We decided to work on a dataset about terrorism because we believe that nowdays it’s a concerning topic for many people in many countries. We encountered a huge dataset published and updated by the University of Maryland. Here there is a little glimpse of how it’s done:
library(ggplot2)
library(raster)
## Loading required package: sp
library(readr)
setwd("C:/Users/felip/Desktop/Análisis de Datos")
globalterrorism <- read_csv("globalterrorismdb_0718dist.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## approxdate = col_character(),
## resolution = col_character(),
## country_txt = col_character(),
## region_txt = col_character(),
## provstate = col_character(),
## city = col_character(),
## location = col_character(),
## summary = col_character(),
## alternative_txt = col_character(),
## attacktype1_txt = col_character(),
## attacktype2_txt = col_character(),
## attacktype3 = col_logical(),
## attacktype3_txt = col_logical(),
## targtype1_txt = col_character(),
## targsubtype1_txt = col_character(),
## corp1 = col_character(),
## target1 = col_character(),
## natlty1_txt = col_character(),
## targtype2_txt = col_character(),
## targsubtype2_txt = col_character()
## # ... with 48 more columns
## )
## See spec(...) for full column specifications.
## Warning: 7113 parsing failures.
## row col expected actual file
## 1687 ransomamtus 1/0/T/F/TRUE/FALSE 20000 'globalterrorismdb_0718dist.csv'
## 1687 ransomnote 1/0/T/F/TRUE/FALSE A note, with a sticker of the Black Liberation Army, demanded $20,000 for the safe return of the bartender. However, the perpetrators left the bartender alone and he was able to escape. 'globalterrorismdb_0718dist.csv'
## 1771 attacktype3 1/0/T/F/TRUE/FALSE 2 'globalterrorismdb_0718dist.csv'
## 1771 attacktype3_txt 1/0/T/F/TRUE/FALSE Armed Assault 'globalterrorismdb_0718dist.csv'
## 3432 ransomnote 1/0/T/F/TRUE/FALSE Perpetrators demanded support by the Dutch in the UN for independence for the Moluccan Islands. 'globalterrorismdb_0718dist.csv'
## .... ............... .................. .......................................................................................................................................................................................... ................................
## See problems(...) for more details.
head(globalterrorism)
## # A tibble: 6 x 135
## eventid iyear imonth iday approxdate extended resolution country
## <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <chr> <dbl>
## 1 1.97e11 1970 7 2 <NA> 0 <NA> 58
## 2 1.97e11 1970 0 0 <NA> 0 <NA> 130
## 3 1.97e11 1970 1 0 <NA> 0 <NA> 160
## 4 1.97e11 1970 1 0 <NA> 0 <NA> 78
## 5 1.97e11 1970 1 0 <NA> 0 <NA> 101
## 6 1.97e11 1970 1 1 <NA> 0 <NA> 217
## # ... with 127 more variables: country_txt <chr>, region <dbl>,
## # region_txt <chr>, provstate <chr>, city <chr>, latitude <dbl>,
## # longitude <dbl>, specificity <dbl>, vicinity <dbl>, location <chr>,
## # summary <chr>, crit1 <dbl>, crit2 <dbl>, crit3 <dbl>, doubtterr <dbl>,
## # alternative <dbl>, alternative_txt <chr>, multiple <dbl>,
## # success <dbl>, suicide <dbl>, attacktype1 <dbl>,
## # attacktype1_txt <chr>, attacktype2 <dbl>, attacktype2_txt <chr>,
## # attacktype3 <lgl>, attacktype3_txt <lgl>, targtype1 <dbl>,
## # targtype1_txt <chr>, targsubtype1 <dbl>, targsubtype1_txt <chr>,
## # corp1 <chr>, target1 <chr>, natlty1 <dbl>, natlty1_txt <chr>,
## # targtype2 <dbl>, targtype2_txt <chr>, targsubtype2 <dbl>,
## # targsubtype2_txt <chr>, corp2 <chr>, target2 <chr>, natlty2 <dbl>,
## # natlty2_txt <chr>, targtype3 <dbl>, targtype3_txt <chr>,
## # targsubtype3 <dbl>, targsubtype3_txt <chr>, corp3 <chr>,
## # target3 <chr>, natlty3 <dbl>, natlty3_txt <chr>, gname <chr>,
## # gsubname <chr>, gname2 <chr>, gsubname2 <lgl>, gname3 <lgl>,
## # gsubname3 <lgl>, motive <chr>, guncertain1 <dbl>, guncertain2 <dbl>,
## # guncertain3 <lgl>, individual <dbl>, nperps <dbl>, nperpcap <dbl>,
## # claimed <dbl>, claimmode <dbl>, claimmode_txt <chr>, claim2 <dbl>,
## # claimmode2 <lgl>, claimmode2_txt <lgl>, claim3 <lgl>,
## # claimmode3 <lgl>, claimmode3_txt <lgl>, compclaim <lgl>,
## # weaptype1 <dbl>, weaptype1_txt <chr>, weapsubtype1 <dbl>,
## # weapsubtype1_txt <chr>, weaptype2 <dbl>, weaptype2_txt <chr>,
## # weapsubtype2 <dbl>, weapsubtype2_txt <chr>, weaptype3 <dbl>,
## # weaptype3_txt <chr>, weapsubtype3 <dbl>, weapsubtype3_txt <chr>,
## # weaptype4 <lgl>, weaptype4_txt <lgl>, weapsubtype4 <lgl>,
## # weapsubtype4_txt <lgl>, weapdetail <chr>, nkill <dbl>, nkillus <dbl>,
## # nkillter <dbl>, nwound <dbl>, nwoundus <dbl>, nwoundte <dbl>,
## # property <dbl>, propextent <dbl>, propextent_txt <chr>,
## # propvalue <dbl>, ...
dim(globalterrorism)
## [1] 181691 135
colnames(globalterrorism)
## [1] "eventid" "iyear" "imonth"
## [4] "iday" "approxdate" "extended"
## [7] "resolution" "country" "country_txt"
## [10] "region" "region_txt" "provstate"
## [13] "city" "latitude" "longitude"
## [16] "specificity" "vicinity" "location"
## [19] "summary" "crit1" "crit2"
## [22] "crit3" "doubtterr" "alternative"
## [25] "alternative_txt" "multiple" "success"
## [28] "suicide" "attacktype1" "attacktype1_txt"
## [31] "attacktype2" "attacktype2_txt" "attacktype3"
## [34] "attacktype3_txt" "targtype1" "targtype1_txt"
## [37] "targsubtype1" "targsubtype1_txt" "corp1"
## [40] "target1" "natlty1" "natlty1_txt"
## [43] "targtype2" "targtype2_txt" "targsubtype2"
## [46] "targsubtype2_txt" "corp2" "target2"
## [49] "natlty2" "natlty2_txt" "targtype3"
## [52] "targtype3_txt" "targsubtype3" "targsubtype3_txt"
## [55] "corp3" "target3" "natlty3"
## [58] "natlty3_txt" "gname" "gsubname"
## [61] "gname2" "gsubname2" "gname3"
## [64] "gsubname3" "motive" "guncertain1"
## [67] "guncertain2" "guncertain3" "individual"
## [70] "nperps" "nperpcap" "claimed"
## [73] "claimmode" "claimmode_txt" "claim2"
## [76] "claimmode2" "claimmode2_txt" "claim3"
## [79] "claimmode3" "claimmode3_txt" "compclaim"
## [82] "weaptype1" "weaptype1_txt" "weapsubtype1"
## [85] "weapsubtype1_txt" "weaptype2" "weaptype2_txt"
## [88] "weapsubtype2" "weapsubtype2_txt" "weaptype3"
## [91] "weaptype3_txt" "weapsubtype3" "weapsubtype3_txt"
## [94] "weaptype4" "weaptype4_txt" "weapsubtype4"
## [97] "weapsubtype4_txt" "weapdetail" "nkill"
## [100] "nkillus" "nkillter" "nwound"
## [103] "nwoundus" "nwoundte" "property"
## [106] "propextent" "propextent_txt" "propvalue"
## [109] "propcomment" "ishostkid" "nhostkid"
## [112] "nhostkidus" "nhours" "ndays"
## [115] "divert" "kidhijcountry" "ransom"
## [118] "ransomamt" "ransomamtus" "ransompaid"
## [121] "ransompaidus" "ransomnote" "hostkidoutcome"
## [124] "hostkidoutcome_txt" "nreleased" "addnotes"
## [127] "scite1" "scite2" "scite3"
## [130] "dbsource" "INT_LOG" "INT_IDEO"
## [133] "INT_MISC" "INT_ANY" "related"
As we can see the dataset is pretty large both considering the rows and the columns. Looking more deeply at the columns it’s clear that there are a lot of numerical and categorical features. This is would be particularly useful in the framework of the classification since it’s possible to use as a “labels” the country, the region, the year, the target of the attack and many other. Particularly interesting would be also performing clustering over this dataset as we could discover hidden relationships between the data. There´s also a big oportunity to implement and compare different types of dimensionality reduction techniques. Moreover one thing that could be useful and also interesting could be making prediction in order to know where and when could happen something having certain characteristics.
Coming now to the objectives of the analysis our main questions are: - in which countries is justified the fear for terrorism? - is it possible to recognize some paths characterizing some regions/states/years? - is it possible to predict some features of the terrorism events in the folowwing year?
The first thing we wanted to do was to visualize the trend for each year concerning the number of events occurring:
yearly <- rep(0, length(unique(globalterrorism$iyear)))
for( i in 1:length(unique(globalterrorism$iyear))){
yearly[i] <- nrow(globalterrorism[which(globalterrorism$iyear==unique(globalterrorism$iyear)[i]),])
}
yearly <- data.frame(cbind(unique(globalterrorism$iyear),yearly))
colnames(yearly) <- c("year","count")
ggplot(yearly,aes(x=year,y=count)) + geom_point(size=3.5) + geom_line(size=1)
It’s clear that as the years pass by the trend of the events/yearyear has always been increasing (speceally in the last decade). Let’s see now if also the total number of deaths/year has increased:
yearly_deaths <- aggregate(nkill ~ iyear, globalterrorism, FUN=sum)
ggplot(yearly_deaths, aes(x=iyear,y=nkill)) + geom_point(size=3.5) + geom_line(size=1)
What immediatly catched our attention is the different scale for the events and deaths. This shows us that the number of events is generally greater than the number of deaths. We will have to pay attention to this because the concept of “terroristic attack” could embrace also very small events. By the way we see that also the trend for the deaths is increasing year after year.
Next we focused on what happen considering the total number of events for each country:
stately <- rep(0, length(unique(globalterrorism$country_txt)))
for( i in 1:length(unique(globalterrorism$country_txt))){
stately[i] <- nrow(globalterrorism[which(globalterrorism$country_txt==unique(globalterrorism$country_txt)[i]),])
}
stately <- data.frame(cbind(unique(globalterrorism$country_txt),stately))
colnames(stately) <- c("country_txt","count")
stately$count <- as.numeric(as.character(stately$count))
stately_ord <- stately[order(stately$count,decreasing=TRUE),]
stately_ord_15 <- stately_ord[1:15,]
ggplot(stately_ord_15,aes(x=reorder(country_txt,count),y=count)) +
geom_bar(stat = "identity") +
xlab("Country") +
coord_flip()
The natuaral prosecution of the last question is to consider the total number of deaths for each country:
state_deaths <- aggregate(nkill ~ country_txt, globalterrorism, FUN=sum)
state_deaths_ord <- state_deaths[order(state_deaths$nkill,decreasing = TRUE),]
state_deaths_15 <- state_deaths_ord[1:15,]
ggplot(state_deaths_15,aes(x=reorder(country_txt,nkill),y=nkill)) +
geom_bar(stat="identity") +
xlab("Country") +
ylab("Deaths") +
coord_flip()
A noteworthy feature of the 2 plots we just showed is that for some countries a high number of events doesn’t necessary mean a high number of killed people. For this reason we decided to compute the ratio deaths/events for every country:
ratio <- merge(stately,state_deaths,by = "country_txt")
ratio <- data.frame(country_txt = ratio$country_txt, r = ratio$nkill/ratio$count)
ratio_ord <- ratio[order(ratio$r,decreasing = TRUE),]
ratio_15 <- ratio_ord[1:15,]
ggplot(ratio_15,aes(x=reorder(country_txt,r),y=r)) +
geom_bar(stat="identity") +
xlab("Country") +
ylab("deaths/events") +
coord_flip()
state_deaths_events <- merge(stately,state_deaths,by = "country_txt")
ggplot(state_deaths_events[which(state_deaths_events$count<10000),]) +
geom_point(aes(x=count,y=nkill)) +
xlab("number of events") +
ylab("deaths") +
geom_abline(intercept = 0, slope = 1, col = "red")
This plots confirmed what we suspected: there are some countries in which there are less events but more severe. In the scatter-plot is easy to see that there are a lot of data lying in the area with more deaths than events. This is suggesting us to be careful in our future analysis hypothesis and conclusions.
An other significant piece of info is the presence of many african countries in the last three plots.
Similarly to the analysis just presented we made something similar considering now macro-regions:
regionly <- rep(0, length(unique(globalterrorism$region_txt)))
for( i in 1:length(unique(globalterrorism$region_txt))){
regionly[i] <- nrow(globalterrorism[which(globalterrorism$region_txt==unique(globalterrorism$region_txt)[i]),])
}
regionly <- data.frame(cbind(unique(globalterrorism$region_txt),regionly))
colnames(regionly) <- c("region_txt","count")
regionly$count <- as.numeric(as.character(regionly$count))
ggplot(regionly,aes(x=reorder(region_txt,count),y=count)) +
geom_bar(stat = "identity") +
xlab("Country") +
coord_flip()
region_deaths <- aggregate(nkill ~ region_txt, globalterrorism, FUN=sum)
ggplot(region_deaths,aes(x=reorder(region_txt,nkill),y=nkill)) +
geom_bar(stat="identity") +
xlab("Country") +
ylab("Deaths") +
coord_flip()
ratio_reg <- merge(regionly,region_deaths,by = "region_txt")
ratio_reg <- data.frame(region_txt = ratio_reg$region_txt, r = ratio_reg$nkill/ratio_reg$count)
ggplot(ratio_reg,aes(x=reorder(region_txt,r),y=r)) +
geom_bar(stat="identity") +
xlab("Country") +
ylab("deaths/events") +
coord_flip()
What is very surprising is that ,despite the fact that Middle East & North Africa and South Asia are the one with most high number of events and deaths, the region with the highest ratio is Sub-Saharan Africa.
To have a better understanding of the geographical distribution of the data we then decided to plot the locations with the number of deaths of the events of the type “Bombings/Explosion” (the most common events associated to a terroristic attack, being present in more than half of the events) for every decades:
worldmap <- shapefile("TM_WORLD_BORDERS-0.3.shp")
worldmap_df <- fortify(worldmap)
## Regions defined for each Polygons
#### Filtro para tipo de evento
globalterrorism$attacktype1_txt <- as.factor(globalterrorism$attacktype1_txt)
levels(globalterrorism$attacktype1_txt)
## [1] "Armed Assault"
## [2] "Assassination"
## [3] "Bombing/Explosion"
## [4] "Facility/Infrastructure Attack"
## [5] "Hijacking"
## [6] "Hostage Taking (Barricade Incident)"
## [7] "Hostage Taking (Kidnapping)"
## [8] "Unarmed Assault"
## [9] "Unknown"
#Bombing/Explosion
bombing <- globalterrorism[which(globalterrorism$attacktype1_txt == "Bombing/Explosion"),]
bombing_70_80 <- bombing[which(bombing$iyear>=1970 & bombing$iyear<1980),]
bombing_80_90 <- bombing[which(bombing$iyear>=1980 & bombing$iyear<1990),]
bombing_90_00 <- bombing[which(bombing$iyear>=1990 & bombing$iyear<2000),]
bombing_00_10 <- bombing[which(bombing$iyear>=2000 & bombing$iyear<2010),]
bombing_10_20 <- bombing[which(bombing$iyear>=2010 & bombing$iyear<2020),]
source("plot_world.R")
plot_world(worldmap_df,bombing_70_80,"Bombing 70/80")
plot_world(worldmap_df,bombing_80_90,"Bombing 80/90")
plot_world(worldmap_df,bombing_90_00,"Bombing 90/00")
plot_world(worldmap_df,bombing_00_10,"Bombing 00/10")
plot_world(worldmap_df,bombing_10_20,"Bombing 10/20")
Similarly we explored the feature regarding “Facility/Infrastructure Attack”:
FI_A <- globalterrorism[which(globalterrorism$attacktype1_txt == "Facility/Infrastructure Attack"),]
FI_A_70_80 <- FI_A[which(FI_A$iyear>=1970 & FI_A$iyear<1980),]
FI_A_80_90 <- FI_A[which(FI_A$iyear>=1980 & FI_A$iyear<1990),]
FI_A_90_00 <- FI_A[which(FI_A$iyear>=1990 & FI_A$iyear<2000),]
FI_A_00_10 <- FI_A[which(FI_A$iyear>=2000 & FI_A$iyear<2010),]
FI_A_10_20 <- FI_A[which(FI_A$iyear>=2010 & FI_A$iyear<2020),]
plot_world(worldmap_df,FI_A_70_80,"Facility/Infrastructure Attack 70/80")
plot_world(worldmap_df,FI_A_80_90,"Facility/Infrastructure Attack 80/90")
plot_world(worldmap_df,FI_A_90_00,"Facility/Infrastructure Attack 90/00")
plot_world(worldmap_df,FI_A_00_10,"Facility/Infrastructure Attack 00/10")
plot_world(worldmap_df,FI_A_10_20,"Facility/Infrastructure Attack 10/20")
All these geographical considerations are useful to get an general idea over the dataset and to visualize better some important facts contained in it.
After having explored in a “geographical” context our data we felt the need to understand better other characteristics of the data such as type, target, and the weapon of the attack.
### ATTACK TYPE
attack <- rep(0, length(unique(globalterrorism$attacktype1_txt)))
for( i in 1:length(unique(globalterrorism$attacktype1_txt))){
attack[i] <- nrow(globalterrorism[which(globalterrorism$attacktype1_txt==unique(globalterrorism$attacktype1_txt)[i]),])
}
attack <- data.frame(attack_type = unique(globalterrorism$attacktype1_txt),count = attack)
colnames(attack) <- c("attacktype1_txt","count")
attack$count <- as.numeric(as.character(attack$count))
attack_ord <- attack[order(attack$count,decreasing=TRUE),]
ggplot(attack_ord,aes(x=reorder(attacktype1_txt,count),y=count)) +
geom_bar(stat = "identity") +
xlab("Attack type") +
coord_flip()
### Ratio
attack_deaths <- aggregate(nkill ~ attacktype1_txt, globalterrorism, FUN=sum)
ratio_att <- merge(attack,attack_deaths,by = "attacktype1_txt")
ratio_att <- data.frame(attacktype1_txt = ratio_att$attacktype1_txt, r = ratio_att$nkill/ratio_att$count)
ggplot(ratio_att,aes(x=reorder(attacktype1_txt,r),y=r)) +
geom_bar(stat = "identity") +
xlab("Attack type") +
ylab("deaths/events") +
coord_flip()
### Box-plot
ggplot(globalterrorism,aes(x=attacktype1_txt,y=nkill)) +
geom_boxplot() +
coord_flip()
## Warning: Removed 10313 rows containing non-finite values (stat_boxplot).
ggplot(globalterrorism,aes(x=attacktype1_txt,y=nkill)) +
geom_boxplot() +
ylim(0,25) +
coord_flip()
## Warning: Removed 12624 rows containing non-finite values (stat_boxplot).
##### TARGET
target <- rep(0, length(unique(globalterrorism$targtype1_txt)))
for( i in 1:length(unique(globalterrorism$targtype1_txt))){
target[i] <- nrow(globalterrorism[which(globalterrorism$targtype1_txt==unique(globalterrorism$targtype1_txt)[i]),])
}
target <- data.frame(target_type = unique(globalterrorism$targtype1_txt), count = target)
colnames(target) <- c("targtype1_txt","count")
target$count <- as.numeric(as.character(target$count))
target_ord <- target[order(target$count,decreasing=TRUE),]
ggplot(target_ord,aes(x=reorder(targtype1_txt,count),y=count)) +
geom_bar(stat = "identity") +
xlab("Target") +
coord_flip()
### Ratio
target_deaths <- aggregate(nkill ~ targtype1_txt, globalterrorism, FUN=sum)
ratio_trg <- merge(target,target_deaths,by = "targtype1_txt")
ratio_trg <- data.frame(targtype1_txt = ratio_trg$targtype1_txt, r = ratio_trg$nkill/ratio_trg$count)
ggplot(ratio_trg,aes(x=reorder(targtype1_txt,r),y=r)) +
geom_bar(stat = "identity") +
xlab("Target type") +
ylab("deaths/events") +
coord_flip()
###Boxplot
ggplot(globalterrorism,aes(x=targtype1_txt,y=nkill)) +
geom_boxplot() +
coord_flip()
## Warning: Removed 10313 rows containing non-finite values (stat_boxplot).
ggplot(globalterrorism,aes(x=targtype1_txt,y=nkill)) +
geom_boxplot() +
ylim(0,25) +
coord_flip()
## Warning: Removed 12624 rows containing non-finite values (stat_boxplot).
#### WEAPON
weapon <- rep(0, length(unique(globalterrorism$weaptype1_txt)))
for( i in 1:length(unique(globalterrorism$weaptype1_txt))){
weapon[i] <- nrow(globalterrorism[which(globalterrorism$weaptype1_txt==unique(globalterrorism$weaptype1_txt)[i]),])
}
weapon <- data.frame(weapon_type = unique(globalterrorism$weaptype1_txt), count = weapon)
colnames(weapon) <- c("weaptype1_txt","count")
weapon$count <- as.numeric(as.character(weapon$count))
weapon_ord <- weapon[order(weapon$count,decreasing=TRUE),]
ggplot(weapon_ord,aes(x=reorder(weaptype1_txt,count),y=count)) +
geom_bar(stat = "identity") +
xlab("Weapon") +
coord_flip()
### Ratio
weapon_deaths <- aggregate(nkill ~ weaptype1_txt, globalterrorism, FUN=sum)
ratio_weap <- merge(weapon,weapon_deaths,by = "weaptype1_txt")
ratio_weap <- data.frame(weaptype1_txt = ratio_weap$weaptype1_txt, r = ratio_weap$nkill/ratio_weap$count)
ggplot(ratio_weap,aes(x=reorder(weaptype1_txt,r),y=r)) +
geom_bar(stat = "identity") +
xlab("weapon type") +
ylab("deaths/events") +
coord_flip()
###Boxplot
ggplot(globalterrorism,aes(x=weaptype1_txt,y=nkill)) +
geom_boxplot() +
coord_flip()
## Warning: Removed 10313 rows containing non-finite values (stat_boxplot).
ggplot(globalterrorism,aes(x=weaptype1_txt,y=nkill)) +
geom_boxplot() +
ylim(0,25) +
coord_flip()
## Warning: Removed 12624 rows containing non-finite values (stat_boxplot).
What we can extract from the last part of our analysis is that the the phenomenon is very complex and has several aspects that must be considered when dealing with the study of it.
For example taking into account the type of the attack, we make ourselves aware of the fact that despite the hijackings are the last for number of events occurred they are first for the ratio deaths/events. The same happen for the kidnappings, while surprisingly the bombings happen to have a “lower” ratio. This is suggesting that certain types of events occur a lot but doesn’t have a high number of victims. Another fact to be noticed is that for the weapon used the vehicles are the ones with the highest ratio but with an “very low” number of occurencies.
For what concern the boxplots it is to highligth how nearly half of the events have low number of deaths when considering the attack type/target type/weapon type. Moreover the “high” number of deaths occuring happens to be at most a quarter of the whole number.
Finally we think that this is a very good dataset over which discover and visualize many things. We could learn a lot preparing ourselves for more advanced procedures and methods. We also could learn how to manage a huge dataset and extract useful information. Not to be underestimated is also the possibility to work on the columns, as we simply did for the ratios.
After a meeting with the auxiliars we are taking into account the idea to change the dataset in favour of one more coherent with the course objectives, meaning an explicit question to answer via a prediction or classification. In spite of that, the feedback during our first presentation (given by the teacher) gave us new ideas to work through this DS. Two of the big questions to solve are:
-Present a cluster with different kind of terrorist events. -Predict some features of future terrorism events (location, number of deaths, etc..)
To achieve it we have further work to do on the terrorism data set.
After having explored the data and discovered interesting facts such as trends concerning attacks, weapons, targets and countries we decided to go deeper in our analysis. In particular in this part of the work we will show, we are going to perform some classification and clustering analysis. Concerning classification we will try to create mainly two model: the first able to predict what kind of attack has certain attributes and second one able to predict the region given the features. On the opposite what we will try to do with clustering is to discover unknown relationships between the data, such as the number of k divitions of the data. Afterwards we will procede to validate the clustering via comparison between the cluster and a cluster of random noise (comparing the correlation of matrices of incidence and distance, as we will show later on the report). ## Data pre-processing
In this part we are preparing the data for the whole analysis. Mainly it’s about dealing with Na’s and selecting the relevant features we are interested in. Firstly we filtered the data eliminating the rows that were not considered for sure as a terrorist attack. Then we proceeded by selecting the features of interests and creating a new binary variable to distinguish between events with deaths or not. Moreover we eliminated those observations having the number of deaths or wounded such that they could be outliers. The rest is explained pretty clearly in the code below:
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:raster':
##
## intersect, select, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
set.seed(123)
data <-read_csv("globalterrorismdb_0718dist.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## approxdate = col_character(),
## resolution = col_character(),
## country_txt = col_character(),
## region_txt = col_character(),
## provstate = col_character(),
## city = col_character(),
## location = col_character(),
## summary = col_character(),
## alternative_txt = col_character(),
## attacktype1_txt = col_character(),
## attacktype2_txt = col_character(),
## attacktype3 = col_logical(),
## attacktype3_txt = col_logical(),
## targtype1_txt = col_character(),
## targsubtype1_txt = col_character(),
## corp1 = col_character(),
## target1 = col_character(),
## natlty1_txt = col_character(),
## targtype2_txt = col_character(),
## targsubtype2_txt = col_character()
## # ... with 48 more columns
## )
## See spec(...) for full column specifications.
## Warning: 7113 parsing failures.
## row col expected actual file
## 1687 ransomamtus 1/0/T/F/TRUE/FALSE 20000 'globalterrorismdb_0718dist.csv'
## 1687 ransomnote 1/0/T/F/TRUE/FALSE A note, with a sticker of the Black Liberation Army, demanded $20,000 for the safe return of the bartender. However, the perpetrators left the bartender alone and he was able to escape. 'globalterrorismdb_0718dist.csv'
## 1771 attacktype3 1/0/T/F/TRUE/FALSE 2 'globalterrorismdb_0718dist.csv'
## 1771 attacktype3_txt 1/0/T/F/TRUE/FALSE Armed Assault 'globalterrorismdb_0718dist.csv'
## 3432 ransomnote 1/0/T/F/TRUE/FALSE Perpetrators demanded support by the Dutch in the UN for independence for the Moluccan Islands. 'globalterrorismdb_0718dist.csv'
## .... ............... .................. .......................................................................................................................................................................................... ................................
## See problems(...) for more details.
dim(data)
## [1] 181691 135
# doubtterr is a binary variable describing the doubts about the "reliability" of the event
df <- data %>% filter(doubtterr == 0 & nkill >= 0)
rm(data)
df <- df %>%
dplyr::select(
# Time
iyear, # Year
imonth, # Month
iday, # Day
# Geospatial
city, # City
latitude, # Geo coordinate
longitude, # Geo coordinate
country, # Country id
region, # Region id, see docs for details
# Numerical
nperps, # Number of perpetrators
nkill, # Death toll
nwound, # No of casualties
nkillter, # Death toll - terrorist only
# Binary
vicinity, # Event occured in city or next to it, binary
ishostkid, # Any hostages ?
ransom, # Any ransom demanded ?
extended, # Event duration above 24hrs
# Categorical
country_txt, # Desc country
region_txt, # Desc region
attacktype1_txt, # Desc attack type
weaptype1_txt, # Desc weapon type
targtype1_txt, # Desc target type
propextent, # Amount of damage done
# Text
summary, # Motive text variable
gname) # Name of organization
long_name <- "Vehicle (not to include vehicle-borne explosives, i.e., car or truck bombs)"
df$weaptype1_txt <- as.character(df$weaptype1_txt)
df$weaptype1_txt[df$weaptype1_txt == long_name] <- "Vehicle"
df$weaptype1_txt <- as.factor(df$weaptype1_txt)
rm(long_name)
long_name <- "Explosives/Bombs/Dynamite"
df$weaptype1_txt <- as.character(df$weaptype1_txt)
df$weaptype1_txt[df$weaptype1_txt == long_name] <- "Explosives"
df$weaptype1_txt <- as.factor(df$weaptype1_txt)
rm(long_name)
### More reduced data
ml_df <- df %>%
# Initial filter
dplyr::filter(nkill >= 0 &
nwound >= 0 &
(longitude >= -180 & longitude <= 200) &
(latitude >= -90 & latitude <= 90) &
(imonth >= 1 & imonth <= 12)) %>%
# Remove outliers
dplyr::filter( !(abs(nkill-mean(nkill))/sd(nkill) > 4)) %>%
dplyr::filter( !(abs(nwound-mean(nwound))/sd(nwound) > 4)) %>%
# Clean dataset
dplyr::mutate(
# Fill with median
#nperps = ifelse(is.na(nperps) | nperps < 1, perp_median, nperps),
# Check binary variables
vicinity = ifelse((vicinity != 0 & vicinity != 1) | is.na(vicinity), 0, vicinity),
extended = ifelse((extended != 0 & extended != 1) | is.na(extended), 0, extended),
ransom = ifelse((ransom != 0 & ransom != 1) | is.na(ransom), 0, ransom),
ishostkid = ifelse((ishostkid != 0 & ishostkid != 1) | is.na(ishostkid), 0, ishostkid),
propextent = ifelse( (propextent < 1 | propextent > 4) | is.na(propextent), 4, propextent)) %>%
# Feature engineering
dplyr::mutate(
# Group known
gknow = ifelse(gname != "Unknown", 1, 0),
# Scale coordinates
lat = as.vector(scale(latitude)),
long = as.vector(scale(longitude))) %>%
# Create binary target
dplyr::mutate(target = ifelse(round(nkill, 0) > 0, 1, 0)) %>%
# Recode vars, scale numericals, convert cat's to factors.
dplyr::mutate(month = factor(imonth),
cnty = factor(country_txt),
rgn = factor(region_txt),
vnity = factor(vicinity),
ext = factor(extended),
rans = factor(ransom),
host = factor(ishostkid),
atype = factor(attacktype1_txt),
wtype = factor(weaptype1_txt),
ttype = factor(targtype1_txt),
gknow = factor(gknow),
prop = factor(propextent),
trgt = factor(target),
lat = as.vector(scale(latitude)),
long = as.vector(scale(longitude))) %>%
# Select clean data
dplyr::select(rgn, vnity, ext, rans, host, atype, wtype, ttype,
prop, target)
As just explained in the introduction we first try to create a model to predict the attack type of an event based on all the features we previously selected such as: - region - weapon - target - killed (yes or no) - …
To do that in this analysis we are going to use Naive Bayes and Decision Trees algorithms as the following lines of code show:
#### CLASSIFICATION
library(caret)
## Loading required package: lattice
trainIndex=createDataPartition(ml_df$atype, p=0.7)$Resample1
train=ml_df[trainIndex, ]
test=ml_df[-trainIndex, ]
printALL=function(model,train_class,test_class){
trainPred=predict(model, newdata = train, type = "class")
trainTable=table(train_class, trainPred)
testPred=predict(model, newdata=test, type="class")
testTable=table(test_class, testPred)
trainAcc=(sum(diag(trainTable)))/sum(trainTable)
testAcc=(sum(diag(testTable)))/sum(testTable)
message("Accuracy")
print(round(cbind(trainAccuracy=trainAcc, testAccuracy=testAcc),3))
}
# Naive Bayes
library(e1071)
##
## Attaching package: 'e1071'
## The following object is masked from 'package:raster':
##
## interpolate
NBclassifier=naiveBayes(atype ~ . , data=train, laplace = 1,na.action = na.omit)
printALL(NBclassifier,train$atype,test$atype)
## Accuracy
## trainAccuracy testAccuracy
## [1,] 0.855 0.855
# Decision Trees
library(rpart)
library(rpart.plot)
DTclassifier <- rpart(atype ~ . , data=train, method = 'class')
rpart.plot(DTclassifier)
printALL(DTclassifier,train$atype,test$atype)
## Accuracy
## trainAccuracy testAccuracy
## [1,] 0.85 0.849
#SVM
library(e1071)
small_train <- train[sample(1:10000),,drop=FALSE]
SVM_Lin_Classifier <- svm(atype~., small_train, method="C-classification", kernel="linear")
printALL(SVM_Lin_Classifier,train$atype,test$atype)
## Accuracy
## trainAccuracy testAccuracy
## [1,] 0.773 0.772
SVM_Pol_Classifier <- svm(atype~., small_train, method="C-classification", kernel="polynomial")
printALL(SVM_Pol_Classifier,train$atype,test$atype)
## Accuracy
## trainAccuracy testAccuracy
## [1,] 0.56 0.56
As we can see the two first classification methods work pretty well, giving an accuracy of 85% both for the training and testing sets. Particularly interesting is the Decision tree plot that shows exactly how the algorithm assign to a certain observation its class. For what concern the Naive Bayes algorithm some interesting things that can be seen are the prior probabilities and the conditional probabilities given the target class. With these two ingredients one can compute the posterior probabilities for the target class. For example the probability that an attack is a Bombing/Explosion given the fact that it take place in Middle East & North Africa is p(ME&NA|B/E)p(B/E) = 0.36 x 0.55 = 0.2. Another example is the probability that the event is a Facility/Infrastructure Attack given that there are no deaths (target = 0) is p(target = 0 |F/I A)p(F/I A) = 0.95 x 0.05 = 0.0475. We clearly see how the likelihood (what we actually see from the data) is updated through the prior to get the posterior. A 95% probability observing the data can be turned, thanks to Bayes rule, in a relatively small probability weighted on the priors. The last example we want to highlight in order to underline the the importance to use both the probabilities is the following: the probability of a Bombing/Explosion to happen given that it occurred in a city (vnity = 0) is 0.94 * 0.55 = 0.52.
In relation of SVM methods, use of liner kernel give better results than a polinomial kernel. But both results are worst than Niave Bayes and Decision Tree.
We were interested also in creating a classifier able to distinguish between the regions were an event takes place, and so we tried to perform the same algorithm described before also for this case:
# Naive Bayes
NBclassifier=naiveBayes(rgn ~ . , data=train, laplace = 1,na.action = na.omit)
printALL(NBclassifier,train$rgn,test$rgn)
## Accuracy
## trainAccuracy testAccuracy
## [1,] 0.357 0.353
# Decision Trees
DTclassifier <- rpart(rgn ~ . , data=train, method = 'class')
rpart.plot(DTclassifier)
printALL(DTclassifier,train$rgn,test$rgn)
## Accuracy
## trainAccuracy testAccuracy
## [1,] 0.341 0.347
As it is possible to see from the accuracy, in this case the algorithm has less ability to predict the region where the event happen. Why is this the case? The answers could be a lot but we think that we are excluding some variables important to characterize the region of an observation. By the way we are pretty satisfied also in this case about the result obtained.
Also, the group use number of killed people as a variable to predict. But in a lot of cases there aren’t killed people in attacks. For that reason, we create a variable “target” with value 1 if the attack had kills and value 0 if attack hadn’t kills. We use “target” as a variable to predict using data mining methods.
set.seed(123)
train$target<-as.factor(train$target)
test$target<-as.factor(test$target)
# Naive Bayes
NBclassifier=naiveBayes(target ~ . , data=train, laplace = 1,na.action = na.omit)
printALL(NBclassifier,train$target,test$target)
## Accuracy
## trainAccuracy testAccuracy
## [1,] 0.731 0.727
# Decision Trees
DTclassifier <- rpart(target ~ . , data=train, method = 'class')
rpart.plot(DTclassifier)
printALL(DTclassifier,train$target,test$target)
## Accuracy
## trainAccuracy testAccuracy
## [1,] 0.734 0.734
#SVM
small_train <- train[sample(1:10000),,drop=FALSE]
SVM_Lin_Classifier <- svm(target~., small_train, method="C-classification", kernel="linear")
printALL(SVM_Lin_Classifier,train$target,test$target)
## Accuracy
## trainAccuracy testAccuracy
## [1,] 0.704 0.7
SVM_Pol_Classifier <- svm(target~., small_train, method="C-classification", kernel="polynomial")
printALL(SVM_Pol_Classifier,train$target,test$target)
## Accuracy
## trainAccuracy testAccuracy
## [1,] 0.551 0.552
Just like the first analysis predicting attack type, we look better results in Naive Bayes and Decision Trees, but we look good results in SVM Model with Linear Kernel and that model is better than Polynomial Kernel SVM.
in this section we will focus on clustering and validation of the respective results. First we install librearies. It worth to notice that in this section we will re define the data from the original.
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:raster':
##
## shift
library(mltools)
##
## Attaching package: 'mltools'
## The following object is masked from 'package:e1071':
##
## skewness
library(cluster)
library(stats)
library(readr)
library(dplyr)
A quick View of the original data.
setwd("C:/Users/felip/Desktop/Análisis de Datos")
data<- read_csv("globalterrorismdb_0718dist.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## approxdate = col_character(),
## resolution = col_character(),
## country_txt = col_character(),
## region_txt = col_character(),
## provstate = col_character(),
## city = col_character(),
## location = col_character(),
## summary = col_character(),
## alternative_txt = col_character(),
## attacktype1_txt = col_character(),
## attacktype2_txt = col_character(),
## attacktype3 = col_logical(),
## attacktype3_txt = col_logical(),
## targtype1_txt = col_character(),
## targsubtype1_txt = col_character(),
## corp1 = col_character(),
## target1 = col_character(),
## natlty1_txt = col_character(),
## targtype2_txt = col_character(),
## targsubtype2_txt = col_character()
## # ... with 48 more columns
## )
## See spec(...) for full column specifications.
## Warning: 7113 parsing failures.
## row col expected actual file
## 1687 ransomamtus 1/0/T/F/TRUE/FALSE 20000 'globalterrorismdb_0718dist.csv'
## 1687 ransomnote 1/0/T/F/TRUE/FALSE A note, with a sticker of the Black Liberation Army, demanded $20,000 for the safe return of the bartender. However, the perpetrators left the bartender alone and he was able to escape. 'globalterrorismdb_0718dist.csv'
## 1771 attacktype3 1/0/T/F/TRUE/FALSE 2 'globalterrorismdb_0718dist.csv'
## 1771 attacktype3_txt 1/0/T/F/TRUE/FALSE Armed Assault 'globalterrorismdb_0718dist.csv'
## 3432 ransomnote 1/0/T/F/TRUE/FALSE Perpetrators demanded support by the Dutch in the UN for independence for the Moluccan Islands. 'globalterrorismdb_0718dist.csv'
## .... ............... .................. .......................................................................................................................................................................................... ................................
## See problems(...) for more details.
dim(data)
## [1] 181691 135
We will only leave the data corresponding to numerical values for the first clustering. We call this value “data1”
data1<-data[,c(2,3, 4, 6 ,8,10,16,17,20,21,22,23,24,26,27,28,29,31,33,35,37,41,43,45,49,51,53,57, 66, 68, 69 , 70,72 ,73,75 , 76,78,79 , 81, 82,84,86,88,90,92, 94,96,99, 100,101,103, 104, 105, 106, 110, 111, 112, 114, 117, 121, 123, 125, 131, 132, 133, 134 )] #We select cols matching the criteria
data1<- as.matrix(data1) #transform to matrix
head(data1)
## iyear imonth iday extended country region specificity vicinity crit1
## [1,] 1970 7 2 0 58 2 1 0 1
## [2,] 1970 0 0 0 130 1 1 0 1
## [3,] 1970 1 0 0 160 5 4 0 1
## [4,] 1970 1 0 0 78 8 1 0 1
## [5,] 1970 1 0 0 101 4 1 0 1
## [6,] 1970 1 1 0 217 1 1 0 1
## crit2 crit3 doubtterr alternative multiple success suicide
## [1,] 1 1 0 NA 0 1 0
## [2,] 1 1 0 NA 0 1 0
## [3,] 1 1 0 NA 0 1 0
## [4,] 1 1 0 NA 0 1 0
## [5,] 1 1 -9 NA 0 1 0
## [6,] 1 1 0 NA 0 1 0
## attacktype1 attacktype2 attacktype3 targtype1 targsubtype1 natlty1
## [1,] 1 NA NA 14 68 58
## [2,] 6 NA NA 7 45 21
## [3,] 1 NA NA 10 54 217
## [4,] 3 NA NA 7 46 217
## [5,] 7 NA NA 7 46 217
## [6,] 2 NA NA 3 22 217
## targtype2 targsubtype2 natlty2 targtype3 targsubtype3 natlty3
## [1,] NA NA NA NA NA NA
## [2,] NA NA NA NA NA NA
## [3,] NA NA NA NA NA NA
## [4,] NA NA NA NA NA NA
## [5,] NA NA NA NA NA NA
## [6,] NA NA NA NA NA NA
## guncertain1 guncertain3 individual nperps claimed claimmode claim2
## [1,] 0 NA 0 NA NA NA NA
## [2,] 0 NA 0 7 NA NA NA
## [3,] 0 NA 0 NA NA NA NA
## [4,] 0 NA 0 NA NA NA NA
## [5,] 0 NA 0 NA NA NA NA
## [6,] 0 NA 0 -99 0 NA NA
## claimmode2 claim3 claimmode3 compclaim weaptype1 weapsubtype1
## [1,] NA NA NA NA 13 NA
## [2,] NA NA NA NA 13 NA
## [3,] NA NA NA NA 13 NA
## [4,] NA NA NA NA 6 16
## [5,] NA NA NA NA 8 NA
## [6,] NA NA NA NA 5 5
## weaptype2 weapsubtype2 weaptype3 weapsubtype3 weaptype4 weapsubtype4
## [1,] NA NA NA NA NA NA
## [2,] NA NA NA NA NA NA
## [3,] NA NA NA NA NA NA
## [4,] NA NA NA NA NA NA
## [5,] NA NA NA NA NA NA
## [6,] NA NA NA NA NA NA
## nkill nkillus nkillter nwoundus nwoundte property propextent
## [1,] 1 NA NA NA NA 0 NA
## [2,] 0 NA NA NA NA 0 NA
## [3,] 1 NA NA NA NA 0 NA
## [4,] NA NA NA NA NA 1 NA
## [5,] NA NA NA NA NA 1 NA
## [6,] 0 0 0 0 0 1 3
## ishostkid nhostkid nhostkidus ndays ransom ransompaidus
## [1,] 0 NA NA NA 0 NA
## [2,] 1 1 0 NA 1 NA
## [3,] 0 NA NA NA 0 NA
## [4,] 0 NA NA NA 0 NA
## [5,] 0 NA NA NA 0 NA
## [6,] 0 NA NA NA 0 NA
## hostkidoutcome nreleased INT_LOG INT_IDEO INT_MISC INT_ANY
## [1,] NA NA 0 0 0 0
## [2,] NA NA 0 1 1 1
## [3,] NA NA -9 -9 1 1
## [4,] NA NA -9 -9 1 1
## [5,] NA NA -9 -9 1 1
## [6,] NA NA -9 -9 0 -9
dim(data1) #(checking the length of col)
## [1] 181691 66
Now there is an analysis of the NA´s existing in data1, it is donemanually, and bellow we present the number of NA´s in each col.
Cols in data1 with value==NA: (# of NA´s)
7 (6),12(1),13(152680),14(1),18(175377),19(181263),21(10373),22(1559),23(170547),24(171006),25(170863),26(180515),27(180594),28(180544),29(380),30(181371),32(71115),33(66120 ),34(162608),35(179801),36(181075),37(181373),38(181558),39(176852),41(20768),42(168564),43(170149),44(179828),45(179998),46(181618),47(181621),48(10313),49(64446),50(66958),51 (64702),52(69143),54(117626),55(178),56(168119),57(168174),58(173567),59(104310),60(181139 ),61(170700),62(171291)
Given the high number of NA´s in several cols, we impose a criteria where we delete any col with a number of NA´s higher than (approximately) 10% of the total of observations (more than 17000). Then, with the remaining cols any NA value is replaced with the mean of the col.
data2<-data1[,c(1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,20,21,22,29,31,40,48,53,55 )]
data3<-data1[,c(1,2,3,4,5,6,7,8,9,10,11,12,14,15,16,17,20,21,22,29,31,40,48,53,55,63,64,65,66 )]
for(i in 1:ncol(data2)){
data2[is.na(data2[,i]), i] <- mean(data2[,i], na.rm = TRUE)
}
for(i in 1:ncol(data3)){
data3[is.na(data3[,i]), i] <- mean(data3[,i], na.rm = TRUE)
}
#summary(data2)
dim(data2)
## [1] 181691 25
#head(data3)
dim(data3)
## [1] 181691 29
We normalize each col, utilizing one of the normalization presented in class (range 0 to 1):
#summary(data2[,1])
data2[,1]=(data2[,1]-min(data2[,1]))/(max(data2[,1])-min(data2[,1]))
min(data2[,1]) #example shown for col number 2
## [1] 0
for (i in 1:25){
data2[,i] <-(data2[,i]-min(data2[,i]))/(max(data2[,i])-min(data2[,i]))
}
for (i in 1:29){
data3[,i] <-(data3[,i]-min(data3[,i]))/(max(data3[,i])-min(data3[,i]))
}
#apply the same for each col
summary(data2) #safety Check
## iyear imonth iday extended
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.4468 1st Qu.:0.3333 1st Qu.:0.2581 1st Qu.:0.00000
## Median :0.8298 Median :0.5000 Median :0.4839 Median :0.00000
## Mean :0.6944 Mean :0.5389 Mean :0.5002 Mean :0.04535
## 3rd Qu.:0.9362 3rd Qu.:0.7500 3rd Qu.:0.7419 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## country region specificity vicinity
## Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.074 1st Qu.:0.3636 1st Qu.:0.0000 1st Qu.:0.9000
## Median :0.094 Median :0.4545 Median :0.0000 Median :0.9000
## Mean :0.128 Mean :0.5601 Mean :0.1129 Mean :0.9068
## 3rd Qu.:0.156 3rd Qu.:0.8182 3rd Qu.:0.0000 3rd Qu.:0.9000
## Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## crit1 crit2 crit3 doubtterr
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:0.9000
## Median :1.0000 Median :1.0000 Median :1.0000 Median :0.9000
## Mean :0.9885 Mean :0.9931 Mean :0.8757 Mean :0.8477
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.9000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## multiple success suicide attacktype1
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.00000 1st Qu.:0.1250
## Median :0.0000 Median :1.0000 Median :0.00000 Median :0.2500
## Mean :0.1378 Mean :0.8896 Mean :0.03651 Mean :0.2809
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.2500
## Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000
## targtype1 targsubtype1 natlty1 guncertain1
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.09524 1st Qu.:0.1964 1st Qu.:0.0790 1st Qu.:0.00000
## Median :0.14286 Median :0.3393 Median :0.1000 Median :0.00000
## Mean :0.35427 Mean :0.4105 Mean :0.1237 Mean :0.08144
## 3rd Qu.:0.61905 3rd Qu.:0.6429 3rd Qu.:0.1640 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## individual weaptype1 nkill property
## Min. :0.00000 Min. :0.0000 Min. :0.0000000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.3333 1st Qu.:0.0000000 1st Qu.:0.9000
## Median :0.00000 Median :0.4167 Median :0.0006369 Median :1.0000
## Mean :0.00295 Mean :0.4539 Mean :0.0015307 Mean :0.8455
## 3rd Qu.:0.00000 3rd Qu.:0.4167 3rd Qu.:0.0012739 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000 Max. :1.0000000 Max. :1.0000
## ishostkid
## Min. :0.0000
## 1st Qu.:0.9000
## Median :0.9000
## Mean :0.9059
## 3rd Qu.:0.9000
## Max. :1.0000
#dim(data2)
As we can see now we have the normalized numerical Data!
We now proceed to use k means algorithm for clustering. We use k up to 10 cluster and check for the point where a higher k does not change the square sum of errors so drastically.
wss <- 0
clust <- 10 # graficaremos hasta 10 clusters
for (i in 1:clust){
wss[i] <-
sum(kmeans(data3, centers=i, nstart=20)$withinss)
}
plot(1:clust, wss, type="b", xlab="Numero de clusters", ylab="wss")
wss <- 0
clust <- 10 # graficaremos hasta 10 clusters
for (i in 1:clust){
wss[i] <-
sum(kmeans(data2, centers=i, nstart=20)$withinss)
}
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)
plot(1:clust, wss, type="b", xlab="Numero de clusters", ylab="wss")
In this case it is clear that the best choice, given the criteria is k=2 (for data3, which was our simplest case of study). As shown later on, the critic elements (those who contributed the most for the clustering analizing data3) are removed and then we apply the clustering for data2.
the optimal number of clusters is no so clear for the data2, we proceed to check the results of cluster for different numbers of k (2,3 & 4).
km.out <- kmeans(data2, 3, nstart = 50) #apply k-means with k=2
c=km.out$cluster #save in variable c the target of each row, this will be used later.
km.out$size #Check the size of each cluster
## [1] 81913 22590 77188
km.out <- kmeans(data3, 2, nstart = 50) #apply k-means with k=2
c_29=km.out$cluster #save in variable c the target of each row, this will be used later.
km.out$size #Check the size of each cluster
## [1] 89045 92646
wss_29<-sum(kmeans(data3, 2, nstart=20)$withinss) #Saves the SSE of the data, used later for validation
wss_29 #Check the SSE
## [1] 245529.9
We create a small matrix (1000x29) of aleatory noise to compare SSE of k menas with k=2. WE create the function Noisify to have random numbers in the range [0,1], matching the range of the normalized data. We do the same for a matrix matching the size of the normalized DS.
Noisify <- function(data) {
if (is.vector(data)) {
noise <- runif(length(data), 0, 1)
noisified <- data + noise
} else {
length <- dim(data)[1] * dim(data)[2]
noise <- matrix(runif(length, 0, 1), dim(data)[1])
noisified <- data + noise
}
return(noisified)
}
a<-matrix(0, nrow = 1000, ncol = 25)
matrix_29<-matrix(0, nrow = 1000, ncol = 29) #This works with the simple case
noise_a<-Noisify(a)
noise_matrix_29<-Noisify(matrix_29)
big_matrix<-matrix(0, nrow =181691 , ncol = 25)
big_noise<-Noisify(big_matrix)
big_matrix_29<-matrix(0, nrow =181691 , ncol = 29)
big_noise_29<-Noisify(big_matrix_29)
Now we Compare the SSE between several iterations of random noise against the obtanied with the data.
We apply the procedure to the simple case (data3), which we only need to check with k=2.
km_noise_29 <- kmeans(noise_matrix_29, 2, nstart = 50)
#km_noise
wss_n2_29<-sum(kmeans(big_noise_29, 2, nstart=20)$withinss) # we calculate the SSE for noise several times this time with k=4
wss_n2_29
## [1] 427540.8
#We obtain SSE from noise and compare them with SSe from the Data.
SSE_of_Noise_and_data3=c(427575.6,427572.7,245529.9,427575.6,427572.7, 427575.6,427572.7,427571.5,427866,427907.1,427725.4)
hist(SSE_of_Noise_and_data3) #Histogram
This is the case for data2, with k=4.
km_noise <- kmeans(noise_a, 4, nstart = 50)
#km_noise
wss_n2<-sum(kmeans(big_noise, 4, nstart=20)$withinss) # we calculate the SSE for noise several times this time with k=4
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 9084550)
wss_n2
## [1] 355070.2
wss_k_m_4<-sum(kmeans(data2, 4, nstart=20)$withinss) # we calculate the SSE for our data
wss_k_m_4
## [1] 176570.9
#We obtain 10 SSE from noise and compare them with SSe from the Data.
wss_10=c(427655.5,427600,427706,427611.2, 427998.7,130562,428006.7,427579.1, 427613.7, 427638.4,427616.1)
SSE_k_4=c( 354713.9,355171.7,354772, 355783.7,176570.9,354885.4)
#hist(wss_10) #histogram showing the diff. of SSE of several noise samples and the DS
hist(SSE_k_4) #Histogram of the same procedure with k=4
We can see a notorious differences between SSE for noise and data in the both cases. Statistically significant!
In this section we calculate the matrices of incidence & proximity, and then the correlation between them. this is applied to the normalized data and the generated noise. Because of the high computacional cost of this, given the high dim of the pre processed data (180000 x 29), we calculate in a sample (with dim of 1000x29). It´s worth notice that we keep the same proportion of classes obtained with k means.
a<-matrix(0, nrow = 1000, ncol = 29) #usedto sample many times noise
noise_a<-Noisify(a)
km_noise_a <- kmeans(noise_a, 2, nstart = 50) #calculate k means for the noise
a=km_noise_a$cluster #save labels
j=cbind(noise_a,a) # bind data / targets
sorted=j[order(j[,30]),] # order the data given the target
#head(sorted) #Verify
sorted_a=sorted[,-30] # delete the target
b=sorted[,30] # save separately the targets
#head(sorted_a) # Verify
incidence_m<-matrix(0, nrow = 1000, ncol = 1000)
now we check for the correlation between the two obtained matrices.
#Function to create an incidence matrix with 1s and 0´s
incidence<-function(incidence_m,a){
for (i in 1:1000){
for (j in 1:1000){
if (a[i]==a[j]){
incidence_m[i,j]<-1
}else{
incidence_m[i,j]<-0
}
}
}
return (incidence_m)
}
incidence_a=incidence(incidence_m,b) #Incidence in noise
m_a <- as.matrix(dist(sorted_a)) #calculate distance matrix
#dim(incidence_a) #Verify dim
cor_a=cor(c(m_a),c(incidence_a)) # calculate correlation between matrices seen as vector.
cor_a
## [1] -0.1446389
#Now we create a a vector of several correlationn for different random noises
var_cor<- matrix(0, nrow = 1, ncol = 1)
for (i in 1:50){
a<-matrix(0, nrow = 1000, ncol = 29) #usedto sample many times noise
noise_a<-Noisify(a)
km_noise_a <- kmeans(noise_a, 2, nstart = 50) #calculate k means for the noise
a=km_noise_a$cluster #save labels
j=cbind(noise_a,a) # bind data / targets
sorted=j[order(j[,30]),] # order the data given the target
sorted_a=sorted[,-30] # delete the target
b=sorted[,30] # save separately the targets
incidence_m<-matrix(0, nrow = 1000, ncol = 1000)
incidence_a=incidence(incidence_m,b) #Incidence in noise
m_a <- as.matrix(dist(sorted_a)) #calculate distance matrix
cor_a=cor(c(m_a),c(incidence_a)) # calculate correlation between matrices seen as vector.
#var[i]=cor_a
var_cor=append(var_cor, cor_a)
i=i+1
if (i==50){
break
}
}
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 50000)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 50000)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 50000)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 50000)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 50000)
#var_cor
#c(var_cor)
Apply the same procedure to a sample of the data.
#First sample the data with the same proportion of classes then order.
#order the data given the targets.
k=cbind(data3,c)
#dim(k)
sorted_data=k[order(k[,30]),] # order according to the cluster target
#head(sorted_data) #Verify
#sorted_data2=sorted_data[,-30] # se elimina la etiqueta de cluster
#d=sorted_data[,30] # se guarda como columna aparte las etiquetas ordenas
#head(d)
#d[90000]
#Sample with k=2
size_clust1=km.out$size[1] #First clust size
size_clust2=km.out$size[2] #Second.
c1_sorted_data=sorted_data[1:size_clust1,] #Data from the first clust
c2_sorted_data=sorted_data[(size_clust1+1):(size_clust1+size_clust2-1),] #data from second
dim(sorted_data)#verify dim
## [1] 181691 30
samples_c1=trunc(size_clust1/(size_clust1+size_clust2)*1000) #we definy first and second cluster size
samples_c2=trunc(size_clust2/(size_clust1+size_clust2)*1000)+1
sample_c1=c1_sorted_data[sample(nrow(c1_sorted_data), samples_c1,replace = FALSE),]
sample_c2=c2_sorted_data[sample(nrow(c2_sorted_data), samples_c2,replace = FALSE),]
#dim(sample_c1)
#dim(sample_c2) #verify
sample_data=rbind(sample_c1,sample_c2) #Unimos muestras
d=sample_data[,30]
sample_data2=sample_data[-30]
dim(sample_data2) #Verify dimensionality
## NULL
verify correlation between matrices.
incidence_data<-matrix(0, nrow = 1000, ncol = 1000) # create matrix
#length(d) Check
incidence_sample_data=incidence(incidence_data,d) #create incidence matrix.
#head(incidence_sample_data)
m_data <- as.matrix(dist(sample_data)) #Create distance matrix
m_data[1000,1000] #Verify values for the diagonal (must be 1´s)
## [1] 0
cor_data=cor(c(m_data),c(incidence_sample_data)) # check correlation value
cor_data
## [1] -0.697568
We have very different values. We present an histogram (including various random noise samples) to check statistical significancy:
corr_10=append(-0.7989984,c(var_cor))
hist(corr_10)
WE can see a clear difference in the values!
We now proceed to check with
Proceding with data3, which contains 29 cols of info
Only has the case with k=2, because of the optimal case seen in the begging.
km.out <- kmeans(data3, 2, nstart = 50) #apply k-means with k=2
c2=km.out$cluster #save in variable c the target of each row, this will be used later.
km.out$size #Check the size of each cluster
## [1] 89045 92646
#First sample the data with the same proportion of classes then order.
#order the data given the targets.
k2=cbind(data3,c2)
#dim(k)
sorted_data=k2[order(k2[,30]),] # order according to the cluster target
#head(sorted_data) #Verify
#sorted_data2=sorted_data[,-30] # se elimina la etiqueta de cluster
#d=sorted_data[,30] # se guarda como columna aparte las etiquetas ordenas
#head(d)
#d[90000]
#Sample with k=2
size_clust1=km.out$size[1] #First clust size
size_clust2=km.out$size[2] #Second.
c1_sorted_data=sorted_data[1:size_clust1,] #Data from the first clust
c2_sorted_data=sorted_data[(size_clust1+1):(size_clust1+size_clust2-1),] #data from second
dim(sorted_data)#verify dim
## [1] 181691 30
samples_c1=trunc(size_clust1/(size_clust1+size_clust2)*1000) #we definy first and second cluster size
samples_c2=trunc(size_clust2/(size_clust1+size_clust2)*1000)+1
sample_c1=c1_sorted_data[sample(nrow(c1_sorted_data), samples_c1,replace = FALSE),]
sample_c2=c2_sorted_data[sample(nrow(c2_sorted_data), samples_c2,replace = FALSE),]
dim(sample_c1)
## [1] 490 30
dim(sample_c2) #verify
## [1] 510 30
sample_data_k2=rbind(sample_c1,sample_c2) #Unimos muestras
d2=sample_data_k2[,30]
sample_data_2=sample_data_k2[,-30]
dim(sample_data_2) #Verify dimensionality
## [1] 1000 29
library(knitr)
promedios_29_c<-matrix(0, nrow = 3, ncol = 29)
#mean(sample_c1[,1])
keys_29=c("iyear","imonth","iday","extended","country","region","specificity","vicinity","crit1","crit2","crit3","doubtterr","multiple","success","suicide","attacktype1","targtype1","targsubtype1","natlty1","guncertain1","individual","weaptype1","nkill","property","ishostkid","INT_LOG","INT_IDEO","INT_MISC","INT_ANY")
for (i in 1:29){
promedios_29_c[1,i]=keys_29[i]
promedios_29_c[2,i]=mean(sample_c1[,i])
promedios_29_c[3,i]=mean(sample_c2[,i])
}
Tabla_29_C <- promedios_29_c
kable(Tabla_29_C)
iyear | imonth | iday | extended | country | region | specificity | vicinity | crit1 | crit2 | crit3 | doubtterr | multiple | success | suicide | attacktype1 | targtype1 | targsubtype1 | natlty1 | guncertain1 | individual | weaptype1 | nkill | property | ishostkid | INT_LOG | INT_IDEO | INT_MISC | INT_ANY |
0.655536257056014 | 0.525 | 0.514878209348255 | 0.0673469387755102 | 0.136608163265306 | 0.541372912801484 | 0.128571428571429 | 0.906326530612245 | 0.995918367346939 | 0.983673469387755 | 0.83469387755102 | 0.913877551020408 | 0.187755102040816 | 0.93469387755102 | 0.0469387755102041 | 0.285204081632653 | 0.340233236151604 | 0.408367370168044 | 0.121067468006455 | 0.155268245259949 | 0 | 0.457823129251701 | 0.00272860331529072 | 0.874489795918367 | 0.908163265306122 | 0.908367346938776 | 0.924897959183673 | 0.91265306122449 | 0.931836734693878 |
0.715978306216103 | 0.538888888888889 | 0.500948766603416 | 0.0313725490196078 | 0.121778431372549 | 0.568805704099822 | 0.0872549019607843 | 0.907647058823529 | 0.968627450980392 | 0.994117647058824 | 0.909803921568627 | 0.775882352941176 | 0.0705882352941176 | 0.872549019607843 | 0.0215686274509804 | 0.268137254901961 | 0.39094304388422 | 0.425855873423162 | 0.116484546365345 | 0.0137254901960784 | 0.00980392156862745 | 0.447385620915033 | 0.00126599042733592 | 0.859803921568628 | 0.905294117647059 | 0 | 0 | 0.906666666666667 | 0.101960784313725 |
We can see a notorious different in the means for the cols INT_LOG, INT_IDEO & INT_ANY for the clusters. So the clusters are mainly ruled by the difference of values of these cols. Given the information we wanted to do a sensibility analysis, removing these cols (these is the reason we created data2, removing them) and check the clusters formed by kmeans without these cols. Now we proceed to check the values of the means for clusters with k=2,3,4 (working on data2, which does not have the values INT).
verify correlation between matrices.
incidence_data<-matrix(0, nrow = 1000, ncol = 1000) # create matrix
#length(d) Check
incidence_sample_data=incidence(incidence_data,d) #create incidence matrix.
#head(incidence_sample_data)
m_data <- as.matrix(dist(sample_data)) #Create distance matrix
m_data[1000,1000] #Verify values for the diagonal (must be 1´s)
## [1] 0
cor_data=cor(c(m_data),c(incidence_sample_data)) # check correlation value
cor_data
## [1] -0.697568
We have very different values. We present an histogram (including various random noise samples) to check statistical significancy:
corr_10=append(-0.7989984,c(var_cor))
hist(corr_10)
WE can see a clear difference in the values!
#For bigger Samples
#Sample with k=2
size_clust1=km.out$size[1] #First clust size
size_clust2=km.out$size[2] #Second.
c1_sorted_data=sorted_data[1:size_clust1,] #Data from the first clust
c2_sorted_data=sorted_data[(size_clust1+1):(size_clust1+size_clust2-1),] #data from second
dim(sorted_data)#verify dim
## [1] 181691 30
samples_c1=trunc(size_clust1/(size_clust1+size_clust2)*5000) #we definy first and second cluster size SIMILAR from earlier on the code but we now have a sample size of 5k
samples_c2=trunc(size_clust2/(size_clust1+size_clust2)*5000)+1
sample_c1=c1_sorted_data[sample(nrow(c1_sorted_data), samples_c1,replace = FALSE),]
sample_c2=c2_sorted_data[sample(nrow(c2_sorted_data), samples_c2,replace = FALSE),]
#dim(sample_c1)
#dim(sample_c2) #verify
sample_data=rbind(sample_c1,sample_c2) #Unimos muestras
sample_data<-as.data.frame(sample_data)
dim(sample_data) #Check Dimensionailty
## [1] 5000 30
km.out <- kmeans(data2, 2, nstart = 50) #apply k-means with k=2
c2=km.out$cluster #save in variable c the target of each row, this will be used later.
km.out$size #Check the size of each cluster
## [1] 103058 78633
#First sample the data with the same proportion of classes then order.
#order the data given the targets.
k2=cbind(data2,c2)
#dim(k)
sorted_data=k2[order(k2[,26]),] # order according to the cluster target
#head(sorted_data) #Verify
#sorted_data2=sorted_data[,-30] # se elimina la etiqueta de cluster
#d=sorted_data[,30] # se guarda como columna aparte las etiquetas ordenas
#head(d)
#d[90000]
#Sample with k=2
size_clust1=km.out$size[1] #First clust size
size_clust2=km.out$size[2] #Second.
c1_sorted_data=sorted_data[1:size_clust1,] #Data from the first clust
c2_sorted_data=sorted_data[(size_clust1+1):(size_clust1+size_clust2-1),] #data from second
dim(sorted_data)#verify dim
## [1] 181691 26
samples_c1=trunc(size_clust1/(size_clust1+size_clust2)*1000) #we definy first and second cluster size
samples_c2=trunc(size_clust2/(size_clust1+size_clust2)*1000)+1
sample_c1=c1_sorted_data[sample(nrow(c1_sorted_data), samples_c1,replace = FALSE),]
sample_c2=c2_sorted_data[sample(nrow(c2_sorted_data), samples_c2,replace = FALSE),]
dim(sample_c1)
## [1] 567 26
dim(sample_c2) #verify
## [1] 433 26
sample_data_k2=rbind(sample_c1,sample_c2) #Unimos muestras
d2=sample_data_k2[,26]
sample_data_2=sample_data_k2[,-26]
dim(sample_data_2) #Verify dimensionality
## [1] 1000 25
library(knitr)
promedios_1<-matrix(0, nrow = 3, ncol = 25)
#mean(sample_c1[,1])
keys_25=c("iyear","imonth","iday","extended","country","region","specificity","vicinity","crit1","crit2","crit3","doubtterr","multiple","success","suicide","attacktype1","targtype1","targsubtype1","natlty1","guncertain1","individual","weaptype1","nkill","property","ishostkid")
for (i in 1:25){
promedios_1[1,i]=keys_25[i]
promedios_1[2,i]=mean(sample_c1[,i])
promedios_1[3,i]=mean(sample_c2[,i])
}
Tabla <- promedios_1
kable(Tabla)
iyear | imonth | iday | extended | country | region | specificity | vicinity | crit1 | crit2 | crit3 | doubtterr | multiple | success | suicide | attacktype1 | targtype1 | targsubtype1 | natlty1 | guncertain1 | individual | weaptype1 | nkill | property | ishostkid |
0.654283462794101 | 0.536743092298648 | 0.483814075211925 | 0.0229276895943563 | 0.129968253968254 | 0.535674202340869 | 0.107583774250441 | 0.908289241622575 | 1 | 0.998236331569665 | 0.798941798941799 | 0.831922398589065 | 0.0864197530864197 | 0.894179894179894 | 0.0352733686067019 | 0.290564373897707 | 0.0980095742000504 | 0.209616505004635 | 0.123239858906526 | 0.0740740740740741 | 0.0017636684303351 | 0.462815990593768 | 0.00161977186168926 | 0.837918871252205 | 0.896483078240124 |
0.733870571470689 | 0.553117782909931 | 0.514639052372793 | 0.0600461893764434 | 0.133644341801386 | 0.610959479319756 | 0.109699769053118 | 0.908083140877598 | 0.976905311778291 | 0.986143187066975 | 1 | 0.851732101616628 | 0.131639722863741 | 0.886836027713626 | 0.0300230946882217 | 0.288394919168591 | 0.70130869899923 | 0.683929199070698 | 0.120040396139572 | 0.0900692840646651 | 0 | 0.454003079291763 | 0.00163964315350063 | 0.828637413394919 | 0.90554272517321 |
Most notable differences for TargetType1, targsubtype1.
Apply the same to k=3
km.out <- kmeans(data2, 3, nstart = 50) #apply k-means with k=3
c3=km.out$cluster #save in variable c the target of each row, this will be used later.
km.out$size #Check the size of each cluster
## [1] 81913 77188 22590
#First sample the data with the same proportion of classes then order.
#order the data given the targets.
k3=cbind(data2,c3)
#dim(k)
sorted_data=k3[order(k3[,26]),] # order according to the cluster target
#head(sorted_data) #Verify
#sorted_data2=sorted_data[,-30] # se elimina la etiqueta de cluster
#d=sorted_data[,30] # se guarda como columna aparte las etiquetas ordenas
#head(d)
#d[90000]
#Sample with k=2
size_clust1=km.out$size[1] #First clust size
size_clust2=km.out$size[2] #Second.
size_clust3=km.out$size[3] #Third
c1_sorted_data=sorted_data[1:size_clust1,] #Data from the first clust
c2_sorted_data=sorted_data[(size_clust1+1):(size_clust1+size_clust2-1),] #data from second
c3_sorted_data=sorted_data[(size_clust1+size_clust2-1):(size_clust1+size_clust2+size_clust3-1),] #Data from the third
dim(sorted_data)#verify dim
## [1] 181691 26
samples_c1=trunc(size_clust1/(size_clust1+size_clust2+size_clust3)*1000) #we definy clusters reduced size
samples_c2=trunc(size_clust2/(size_clust1+size_clust2+size_clust3)*1000)+1
samples_c3=trunc(size_clust3/(size_clust1+size_clust2+size_clust3)*1000)+1
samples_c1+samples_c2+samples_c3 #Verify sum=1000
## [1] 1000
sample_c1=c1_sorted_data[sample(nrow(c1_sorted_data), samples_c1,replace = FALSE),]
sample_c2=c2_sorted_data[sample(nrow(c2_sorted_data), samples_c2,replace = FALSE),]
sample_c3=c3_sorted_data[sample(nrow(c3_sorted_data), samples_c3,replace = FALSE),]
#dim(sample_c1)
#dim(sample_c2) #verify
sample_data_k3=rbind(sample_c1,sample_c2,sample_c3) #Unimos muestras
d3=sample_data_k3[,26]
sample_data_3=sample_data_k3[,-26]
dim(sample_data_3) #Verify dimensionality
## [1] 1000 25
promedios_2<-matrix(0, nrow = 4, ncol = 25)
#mean(sample_c1[,1])
keys_25=c("iyear","imonth","iday","extended","country","region","specificity","vicinity","crit1","crit2","crit3","doubtterr","multiple","success","suicide","attacktype1","targtype1","targsubtype1","natlty1","guncertain1","individual","weaptype1","nkill","property","ishostkid")
for (i in 1:25){
promedios_2[1,i]=keys_25[i]
promedios_2[2,i]=mean(sample_c1[,i])
promedios_2[3,i]=mean(sample_c2[,i])
promedios_2[4,i]=mean(sample_c3[,i])
}
Tabla2 <- promedios_2
kable(Tabla2)
iyear | imonth | iday | extended | country | region | specificity | vicinity | crit1 | crit2 | crit3 | doubtterr | multiple | success | suicide | attacktype1 | targtype1 | targsubtype1 | natlty1 | guncertain1 | individual | weaptype1 | nkill | property | ishostkid |
0.65806146572104 | 0.546481481481481 | 0.508888888888889 | 0.0333333333333333 | 0.133968888888889 | 0.516161616161616 | 0.0844444444444444 | 0.906222222222222 | 0.991111111111111 | 0.997777777777778 | 1 | 0.824888888888889 | 0.117777777777778 | 0.893333333333333 | 0.0311111111111111 | 0.279722222222222 | 0.095026455026455 | 0.199355457533998 | 0.12306596986901 | 0.0577777777777778 | 0.00222222222222222 | 0.457777777777778 | 0.00110214037321241 | 0.878222222222222 | 0.902248468276224 |
0.742127659574468 | 0.512745098039216 | 0.504667931688805 | 0.0658823529411765 | 0.123632941176471 | 0.59379679144385 | 0.127647058823529 | 0.905176470588235 | 0.976470588235294 | 0.985882352941176 | 1 | 0.825411764705882 | 0.141176470588235 | 0.891764705882353 | 0.0282352941176471 | 0.291764705882353 | 0.693557422969188 | 0.669300164046987 | 0.118672622142455 | 0.0756773655405872 | 0.00470588235294118 | 0.468235294117647 | 0.00167983812749427 | 0.833647058823529 | 0.908941176470588 |
0.730893617021277 | 0.546 | 0.507612903225806 | 0 | 0.13728 | 0.576 | 0.12 | 0.9112 | 1 | 1 | 0 | 0.9992 | 0.096 | 0.912 | 0.04 | 0.285 | 0.142857142857143 | 0.2705693910072 | 0.13572 | 0.0246515214189983 | 0.008 | 0.456 | 0.00214851369264336 | 0.8376 | 0.9008 |
the cluster 3 has small values for extended, crit3 and targettype1. We see again different values for the clusters on target type1 and targetsubtype.
apply the same to k=4
km.out <- kmeans(data2, 4, nstart = 50) #apply k-means with k=4
c4=km.out$cluster #save in variable c4 the target of each row, this will be used later.
#First sample the data with the same proportion of classes then order.
#order the data given the targets.
k4=cbind(data2,c4)
#dim(k)
sorted_data=k4[order(k4[,26]),] # order according to the cluster target
#head(sorted_data) #Verify
#sorted_data2=sorted_data[,-30] # se elimina la etiqueta de cluster
#d=sorted_data[,30] # se guarda como columna aparte las etiquetas ordenas
#head(d)
#d[90000]
#Sample with k=2
size_clust1=km.out$size[1] #First clust size
size_clust2=km.out$size[2] #Second.
size_clust3=km.out$size[3] #Third
size_clust4=km.out$size[4] #Fourth.
c1_sorted_data=sorted_data[1:size_clust1,] #Data from the first clust
c2_sorted_data=sorted_data[(size_clust1+1):(size_clust1+size_clust2-1),] #data from second
c3_sorted_data=sorted_data[(size_clust1+size_clust2-1):(size_clust1+size_clust2+size_clust3-1),] #Data from the third
c4_sorted_data=sorted_data[(size_clust1+size_clust2+size_clust3):(size_clust1+size_clust2+size_clust3+size_clust4-1),] #data from second
dim(sorted_data)#verify dim
## [1] 181691 26
samples_c1=trunc(size_clust1/(size_clust1+size_clust2+size_clust3+size_clust4)*1000) #we definy clusters reduced size
samples_c2=trunc(size_clust2/(size_clust1+size_clust2+size_clust3+size_clust4)*1000)+1
samples_c3=trunc(size_clust3/(size_clust1+size_clust2+size_clust3+size_clust4)*1000)
samples_c4=trunc(size_clust4/(size_clust1+size_clust2+size_clust3+size_clust4)*1000)+1
samples_c1+samples_c2+samples_c3+samples_c4 #Verify sum=1000
## [1] 1000
sample_c1=c1_sorted_data[sample(nrow(c1_sorted_data), samples_c1,replace = FALSE),]
sample_c2=c2_sorted_data[sample(nrow(c2_sorted_data), samples_c2,replace = FALSE),]
sample_c3=c3_sorted_data[sample(nrow(c3_sorted_data), samples_c3,replace = FALSE),]
sample_c4=c4_sorted_data[sample(nrow(c4_sorted_data), samples_c4,replace = FALSE),]
#dim(sample_c1)
#dim(sample_c2) #verify
sample_data_k4=rbind(sample_c1,sample_c2,sample_c3,sample_c4) #Unimos muestras
d4=sample_data_k4[,26]
sample_data_4=sample_data_k4[,-26]
dim(sample_data_4) #Verify dimensionality
## [1] 1000 25
promedios_3<-matrix(0, nrow = 5, ncol = 25)
#mean(sample_c1[,1])
keys_25=c("iyear","imonth","iday","extended","country","region","specificity","vicinity","crit1","crit2","crit3","doubtterr","multiple","success","suicide","attacktype1","targtype1","targsubtype1","natlty1","guncertain1","individual","weaptype1","nkill","property","ishostkid")
for (i in 1:25){
promedios_3[1,i]=keys_25[i]
promedios_3[2,i]=mean(sample_c1[,i])
promedios_3[3,i]=mean(sample_c2[,i])
promedios_3[4,i]=mean(sample_c3[,i])
promedios_3[5,i]=mean(sample_c4[,i])
}
Tabla3 <- promedios_3
kable(Tabla3)
iyear | imonth | iday | extended | country | region | specificity | vicinity | crit1 | crit2 | crit3 | doubtterr | multiple | success | suicide | attacktype1 | targtype1 | targsubtype1 | natlty1 | guncertain1 | individual | weaptype1 | nkill | property | ishostkid |
0.721841567496093 | 0.556261770244821 | 0.490978676872608 | 0.0480225988700565 | 0.12869209039548 | 0.567282999486389 | 0.118962889610425 | 0.908757062146893 | 0.985875706214689 | 0.985875706214689 | 1 | 0.824576271186441 | 0 | 0.827683615819209 | 0.0310734463276836 | 0.278954802259887 | 0.696529459241324 | 0.663934821646403 | 0.114831896972979 | 0.0743665556765513 | 0.00564971751412429 | 0.461158192090395 | 0.00160794230392344 | 0.853389830508475 | 0.907942968147742 |
0.644208037825059 | 0.532828282828283 | 0.504398826979472 | 0.047979797979798 | 0.123280303030303 | 0.538337924701561 | 0.106691919191919 | 0.907575757575758 | 0.98989898989899 | 0.997474747474748 | 1 | 0.819191919191919 | 0 | 0.911616161616162 | 0.0353535353535354 | 0.262626262626263 | 0.0983645983645984 | 0.207510711122689 | 0.126468905154178 | 0.0757575757575758 | 0 | 0.437079124579125 | 0.00115815654069343 | 0.860858585858586 | 0.906313131313131 |
0.692004118050789 | 0.50739247311828 | 0.491155046826223 | 0.0161290322580645 | 0.135379032258065 | 0.565982404692082 | 0.189516129032258 | 0.912903225806452 | 1 | 1 | 0 | 1 | 0.0725806451612903 | 0.919354838709677 | 0.0483870967741935 | 0.282258064516129 | 0.142857142857143 | 0.272029302145852 | 0.136975806451613 | 0.024850324011087 | 0 | 0.446236559139785 | 0.00312639106318708 | 0.818548387096774 | 0.904032258064516 |
0.754643701452212 | 0.541666666666667 | 0.516385048643113 | 0.0238095238095238 | 0.13152380952381 | 0.542568542568543 | 0.136904761904762 | 0.906349206349206 | 1 | 0.992063492063492 | 1 | 0.865079365079365 | 1 | 0.920634920634921 | 0.0396825396825397 | 0.334325396825397 | 0.473922902494331 | 0.506005050092627 | 0.126360102238958 | 0.0641384141061491 | 0 | 0.462301587301587 | 0.000666295342216777 | 0.864285714285714 | 0.898412698412698 |
Visualization of two clusters based on features comparison:
We see some differences between both clusters, but there are only two variables (INT_LOG and INT_IDEO) influence the difference of points on two clusters.
We will see clustering analysis with silhouette score method. That method needs a sample of original data.
library(factoextra)
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
set.seed(123)
sample_data2=rbind(sample_c1,sample_c2)
sample_data2<-as.data.frame(sample_data2)
fviz_nbclust(sample_data2, kmeans, method='silhouette')
The optimal number of clusters is 2. Now, we will see clustering assignment with 2 clusters in a sample of the data (5k), keeping the same proportion of classes from earlier on the code.
In the last table we can look differences between both clusters. Observing the last analysis, it can be concluded that there are no visible characteristics of both types of clusters, for the silhouette score. The model performed only discriminates two variables(“INT_LOG” and “INT_IDEO”), and determines that they have greater separation in the data and fix the existing clusters.
After having explored the classification and clustering methods and learnt a lot about our data the last thing we’re going to do is studying if there are certain pattern. For this we are going to investigate using the a-priori algorithm. We are working with the usual data:
library(readr)
library(dplyr)
set.seed(123)
setwd("C:/Users/felip/Desktop/Análisis de Datos")
data <- read_csv("globalterrorismdb_0718dist.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## approxdate = col_character(),
## resolution = col_character(),
## country_txt = col_character(),
## region_txt = col_character(),
## provstate = col_character(),
## city = col_character(),
## location = col_character(),
## summary = col_character(),
## alternative_txt = col_character(),
## attacktype1_txt = col_character(),
## attacktype2_txt = col_character(),
## attacktype3 = col_logical(),
## attacktype3_txt = col_logical(),
## targtype1_txt = col_character(),
## targsubtype1_txt = col_character(),
## corp1 = col_character(),
## target1 = col_character(),
## natlty1_txt = col_character(),
## targtype2_txt = col_character(),
## targsubtype2_txt = col_character()
## # ... with 48 more columns
## )
## See spec(...) for full column specifications.
## Warning: 7113 parsing failures.
## row col expected actual file
## 1687 ransomamtus 1/0/T/F/TRUE/FALSE 20000 'globalterrorismdb_0718dist.csv'
## 1687 ransomnote 1/0/T/F/TRUE/FALSE A note, with a sticker of the Black Liberation Army, demanded $20,000 for the safe return of the bartender. However, the perpetrators left the bartender alone and he was able to escape. 'globalterrorismdb_0718dist.csv'
## 1771 attacktype3 1/0/T/F/TRUE/FALSE 2 'globalterrorismdb_0718dist.csv'
## 1771 attacktype3_txt 1/0/T/F/TRUE/FALSE Armed Assault 'globalterrorismdb_0718dist.csv'
## 3432 ransomnote 1/0/T/F/TRUE/FALSE Perpetrators demanded support by the Dutch in the UN for independence for the Moluccan Islands. 'globalterrorismdb_0718dist.csv'
## .... ............... .................. .......................................................................................................................................................................................... ................................
## See problems(...) for more details.
dim(data)
## [1] 181691 135
# doubtterr is a binary variable describing the doubts about the "reliability" of the event
df <- data %>% filter(doubtterr == 0 & nkill >= 0)
rm(data)
df <- df %>%
dplyr::select(
# Time
iyear, # Year
imonth, # Month
iday, # Day
# Geospatial
city, # City
latitude, # Geo coordinate
longitude, # Geo coordinate
country, # Country id
region, # Region id, see docs for details
# Numerical
nperps, # Number of perpetrators
nkill, # Death toll
nwound, # No of casualties
nkillter, # Death toll - terrorist only
# Binary
vicinity, # Event occured in city or next to it, binary
ishostkid, # Any hostages ?
ransom, # Any ransom demanded ?
extended, # Event duration above 24hrs
# Categorical
country_txt, # Desc country
region_txt, # Desc region
attacktype1_txt, # Desc attack type
weaptype1_txt, # Desc weapon type
targtype1_txt, # Desc target type
propextent, # Amount of damage done
# Text
summary, # Motive text variable
gname) # Name of organization
long_name <- "Vehicle (not to include vehicle-borne explosives, i.e., car or truck bombs)"
df$weaptype1_txt <- as.character(df$weaptype1_txt)
df$weaptype1_txt[df$weaptype1_txt == long_name] <- "Vehicle"
df$weaptype1_txt <- as.factor(df$weaptype1_txt)
rm(long_name)
long_name <- "Explosives/Bombs/Dynamite"
df$weaptype1_txt <- as.character(df$weaptype1_txt)
df$weaptype1_txt[df$weaptype1_txt == long_name] <- "Explosives"
df$weaptype1_txt <- as.factor(df$weaptype1_txt)
rm(long_name)
### More reduced data
ml_df <- df %>%
# Initial filter
dplyr::filter(nkill >= 0 &
nwound >= 0 &
(longitude >= -180 & longitude <= 200) &
(latitude >= -90 & latitude <= 90) &
(imonth >= 1 & imonth <= 12)) %>%
# Remove outliers
dplyr::filter( !(abs(nkill-mean(nkill))/sd(nkill) > 4)) %>%
dplyr::filter( !(abs(nwound-mean(nwound))/sd(nwound) > 4)) %>%
# Clean dataset
dplyr::mutate(
# Fill with median
#nperps = ifelse(is.na(nperps) | nperps < 1, perp_median, nperps),
# Check binary variables
vicinity = ifelse((vicinity != 0 & vicinity != 1) | is.na(vicinity), 0, vicinity),
extended = ifelse((extended != 0 & extended != 1) | is.na(extended), 0, extended),
ransom = ifelse((ransom != 0 & ransom != 1) | is.na(ransom), 0, ransom),
ishostkid = ifelse((ishostkid != 0 & ishostkid != 1) | is.na(ishostkid), 0, ishostkid),
propextent = ifelse( (propextent < 1 | propextent > 4) | is.na(propextent), 4, propextent)) %>%
# Feature engineering
dplyr::mutate(
# Group known
gknow = ifelse(gname != "Unknown", 1, 0),
# Scale coordinates
lat = as.vector(scale(latitude)),
long = as.vector(scale(longitude))) %>%
# Create binary target
dplyr::mutate(target = ifelse(round(nkill, 0) > 0, 1, 0)) %>%
# Recode vars, scale numericals, convert cat's to factors.
dplyr::mutate(month = factor(imonth),
cnty = factor(country_txt),
rgn = factor(region_txt),
vnity = factor(vicinity),
ext = factor(extended),
rans = factor(ransom),
host = factor(ishostkid),
atype = factor(attacktype1_txt),
wtype = factor(weaptype1_txt),
ttype = factor(targtype1_txt),
gknow = factor(gknow),
gname = factor(gname),
prop = factor(propextent),
trgt = factor(target),
lat = as.vector(scale(latitude)),
long = as.vector(scale(longitude))) %>%
# Select clean data
dplyr::select(rgn, wtype, atype, ttype, trgt, gname)
In order to work with the algorithm in r we need to convert the data into a “transaction” object:
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
## Loading required package: grid
tData <- as(ml_df, "transactions")
Now we are ready to explore the world of association rules. We will begin the analysis with a brief look at the frequency of the item sets:
frequentItems <- eclat(tData, parameter = list(supp = 0.5, maxlen = 5)) # calculates support for frequent items
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.5 1 5 frequent itemsets FALSE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 62076
##
## create itemset ...
## set transactions ...[2957 item(s), 124152 transaction(s)] done [0.16s].
## sorting and recoding items ... [3 item(s)] done [0.02s].
## creating bit matrix ... [3 row(s), 124152 column(s)] done [0.00s].
## writing ... [4 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(frequentItems)
## items support count
## [1] {wtype=Explosives,atype=Bombing/Explosion} 0.5426493 67371
## [2] {wtype=Explosives} 0.5748598 71370
## [3] {trgt=0} 0.5512275 68436
## [4] {atype=Bombing/Explosion} 0.5500677 68292
itemFrequencyPlot(tData, topN=10, type="absolute", main="Item Frequency") # plot frequent items
From the histogram is possible to see the frequency of the itemsets and this will be particularly useful in the future analysis. So what comes next? Let’s see the first analysis what results give:
rules <- apriori (tData, parameter = list(supp = 0.1, conf = 0.5, minlen = 2, maxlen=5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.1 2
## maxlen target ext
## 5 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 12415
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[2957 item(s), 124152 transaction(s)] done [0.15s].
## sorting and recoding items ... [15 item(s)] done [0.00s].
## creating transaction tree ... done [0.07s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [78 rule(s)] done [0.00s].
## creating S4 object ... done [0.02s].
rules_conf <- sort (rules, by="confidence", decreasing=TRUE) # 'high-confidence' rules.
inspect(head(rules_conf)) # show the support, lift and confidence for all rules
## lhs rhs support confidence lift count
## [1] {rgn=Middle East & North Africa,
## atype=Bombing/Explosion,
## gname=Unknown} => {wtype=Explosives} 0.1366551 0.9970616 1.734443 16966
## [2] {atype=Bombing/Explosion,
## trgt=1,
## gname=Unknown} => {wtype=Explosives} 0.1075698 0.9950823 1.731000 13355
## [3] {rgn=Middle East & North Africa,
## atype=Bombing/Explosion,
## trgt=1} => {wtype=Explosives} 0.1019637 0.9945789 1.730124 12659
## [4] {rgn=Middle East & North Africa,
## atype=Bombing/Explosion} => {wtype=Explosives} 0.1951318 0.9945400 1.730056 24226
## [5] {rgn=South Asia,
## atype=Bombing/Explosion} => {wtype=Explosives} 0.1437351 0.9938182 1.728801 17845
## [6] {atype=Bombing/Explosion,
## gname=Unknown} => {wtype=Explosives} 0.2989883 0.9935494 1.728333 37120
rules_lift <- sort (rules, by="lift", decreasing=TRUE) # 'high-lift' rules.
inspect(head(rules_lift)) # show the support, lift and confidence for all rules
## lhs rhs support confidence lift count
## [1] {atype=Armed Assault,
## trgt=1} => {wtype=Firearms} 0.1277225 0.9152141 3.214123 15857
## [2] {wtype=Firearms,
## trgt=1} => {atype=Armed Assault} 0.1277225 0.6245372 3.197952 15857
## [3] {wtype=Firearms} => {atype=Armed Assault} 0.1702188 0.5977880 3.060982 21133
## [4] {atype=Armed Assault} => {wtype=Firearms} 0.1702188 0.8716077 3.060982 21133
## [5] {wtype=Explosives,
## atype=Bombing/Explosion,
## trgt=1} => {rgn=Middle East & North Africa} 0.1019637 0.5498653 1.894829 12659
## [6] {atype=Bombing/Explosion,
## trgt=1} => {rgn=Middle East & North Africa} 0.1025195 0.5459381 1.881295 12728
At a first glance it’s immediatly clear that the attack type “Bombing/Explosion” (and consequently weapon type “Explosives”) dominates the rules pattern and for this in the next analysis we will drop the observation regarding these attacks.
What is useful to look at is redundant rules. These rules are subsets of larger rules and thus can be dropped from the rules given by the algorithm.
#dropping redundant rules
subsetRules <- which(colSums(is.subset(rules, rules)) > 1) # get subset rules in vector
length(subsetRules)
## [1] 65
rules <- rules[-subsetRules] # remove subset rules.
rules_conf <- sort (rules, by="confidence", decreasing=TRUE) # 'high-confidence' rules.
inspect(head(rules_conf)) # show the support, lift and confidence for all rules
## lhs rhs support confidence lift count
## [1] {wtype=Firearms} => {trgt=1} 0.2045074 0.7182055 1.600378 25390
## [2] {atype=Armed Assault} => {trgt=1} 0.1395547 0.7145921 1.592326 17326
## [3] {rgn=Middle East & North Africa} => {wtype=Explosives} 0.2045477 0.7048684 1.226157 25395
## [4] {rgn=Middle East & North Africa} => {atype=Bombing/Explosion} 0.1962030 0.6761130 1.229145 24359
## [5] {rgn=Middle East & North Africa} => {gname=Unknown} 0.1848379 0.6369490 1.361521 22948
## [6] {ttype=Private Citizens & Property} => {trgt=1} 0.1511695 0.5874914 1.309107 18768
rules_lift <- sort (rules, by="lift", decreasing=TRUE) # 'high-lift' rules.
inspect(head(rules_lift)) # show the support, lift and confidence for all rules
## lhs rhs support confidence lift count
## [1] {wtype=Firearms} => {trgt=1} 0.2045074 0.7182055 1.600378 25390
## [2] {atype=Armed Assault} => {trgt=1} 0.1395547 0.7145921 1.592326 17326
## [3] {rgn=Middle East & North Africa} => {gname=Unknown} 0.1848379 0.6369490 1.361521 22948
## [4] {ttype=Private Citizens & Property} => {trgt=1} 0.1511695 0.5874914 1.309107 18768
## [5] {rgn=Middle East & North Africa} => {trgt=1} 0.1627682 0.5608971 1.249847 20208
## [6] {rgn=Middle East & North Africa} => {atype=Bombing/Explosion} 0.1962030 0.6761130 1.229145 24359
Dropping the redundant rules seems to fix in some way the dependance on the event “Explosion/Bombing”. From the lists above is possible to see the most confidential rules and the ones with higher lift.
Let’s have a look on what happens if we drop it.
We are going to investigate what are the rules without that kind of event:
tData_2 <- as(ml_df[which(ml_df$atype!="Bombing/Explosion"),], "transactions")
frequentItems <- eclat(tData_2, parameter = list(supp = 0.5, maxlen = 5)) # calculates support for frequent items
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.5 1 5 frequent itemsets FALSE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 27930
##
## create itemset ...
## set transactions ...[2113 item(s), 55860 transaction(s)] done [0.07s].
## sorting and recoding items ... [2 item(s)] done [0.02s].
## creating bit matrix ... [2 row(s), 55860 column(s)] done [0.00s].
## writing ... [2 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(frequentItems)
## items support count
## [1] {wtype=Firearms} 0.6246867 34895
## [2] {trgt=1} 0.5800573 32402
itemFrequencyPlot(tData_2, topN=10, type="absolute", main="Item Frequency") # plot frequent items
rules <- apriori (tData_2, parameter = list(supp = 0.1, conf = 0.5, minlen = 2, maxlen=5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.1 2
## maxlen target ext
## 5 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 5586
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[2113 item(s), 55860 transaction(s)] done [0.07s].
## sorting and recoding items ... [17 item(s)] done [0.00s].
## creating transaction tree ... done [0.03s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [63 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
rules_conf <- sort (rules, by="confidence", decreasing=TRUE) # 'high-confidence' rules.
inspect(head(rules_conf)) # show the support, lift and confidence for all rules
## lhs rhs support confidence lift count
## [1] {wtype=Incendiary,
## atype=Facility/Infrastructure Attack} => {trgt=0} 0.1011636 0.9701288 2.310145 5651
## [2] {atype=Facility/Infrastructure Attack} => {trgt=0} 0.1253491 0.9521349 2.267297 7002
## [3] {wtype=Incendiary} => {trgt=0} 0.1202649 0.9475317 2.256336 6718
## [4] {atype=Armed Assault,
## trgt=1,
## gname=Unknown} => {wtype=Firearms} 0.1230755 0.9464482 1.515077 6875
## [5] {rgn=South Asia,
## atype=Armed Assault,
## trgt=1} => {wtype=Firearms} 0.1028285 0.9228792 1.477347 5744
## [6] {atype=Armed Assault,
## trgt=1} => {wtype=Firearms} 0.2838704 0.9152141 1.465077 15857
rules_lift <- sort (rules, by="lift", decreasing=TRUE) # 'high-lift' rules.
inspect(head(rules_lift)) # show the support, lift and confidence for all rules
## lhs rhs support confidence lift count
## [1] {wtype=Incendiary,
## trgt=0} => {atype=Facility/Infrastructure Attack} 0.1011636 0.8411730 6.389437 5651
## [2] {atype=Facility/Infrastructure Attack,
## trgt=0} => {wtype=Incendiary} 0.1011636 0.8070551 6.358547 5651
## [3] {wtype=Incendiary} => {atype=Facility/Infrastructure Attack} 0.1042786 0.8215797 6.240609 5825
## [4] {atype=Facility/Infrastructure Attack} => {wtype=Incendiary} 0.1042786 0.7920859 6.240609 5825
## [5] {wtype=Incendiary,
## atype=Facility/Infrastructure Attack} => {trgt=0} 0.1011636 0.9701288 2.310145 5651
## [6] {atype=Facility/Infrastructure Attack} => {trgt=0} 0.1253491 0.9521349 2.267297 7002
#dropping redundant rules
subsetRules <- which(colSums(is.subset(rules, rules)) > 1) # get subset rules in vector
length(subsetRules)
## [1] 44
rules <- rules[-subsetRules] # remove subset rules.
rules_conf <- sort (rules, by="confidence", decreasing=TRUE) # 'high-confidence' rules.
inspect(head(rules_conf)) # show the support, lift and confidence for all rules
## lhs rhs support
## [1] {atype=Facility/Infrastructure Attack} => {trgt=0} 0.1253491
## [2] {wtype=Incendiary} => {trgt=0} 0.1202649
## [3] {atype=Assassination} => {trgt=1} 0.1754565
## [4] {ttype=Police} => {wtype=Firearms} 0.1437701
## [5] {ttype=Police} => {trgt=1} 0.1384891
## [6] {atype=Assassination} => {wtype=Firearms} 0.1650555
## confidence lift count
## [1] 0.9521349 2.267297 7002
## [2] 0.9475317 2.256336 6718
## [3] 0.7378604 1.272048 9801
## [4] 0.7332238 1.173746 8031
## [5] 0.7062905 1.217622 7736
## [6] 0.6941203 1.111149 9220
rules_lift <- sort (rules, by="lift", decreasing=TRUE) # 'high-lift' rules.
inspect(head(rules_lift)) # show the support, lift and confidence for all rules
## lhs rhs
## [1] {atype=Facility/Infrastructure Attack} => {trgt=0}
## [2] {wtype=Incendiary} => {trgt=0}
## [3] {ttype=Police} => {atype=Armed Assault}
## [4] {rgn=Middle East & North Africa} => {gname=Unknown}
## [5] {atype=Assassination} => {trgt=1}
## [6] {ttype=Police} => {trgt=1}
## support confidence lift count
## [1] 0.1253491 0.9521349 2.267297 7002
## [2] 0.1202649 0.9475317 2.256336 6718
## [3] 0.1259040 0.6421072 1.479341 7033
## [4] 0.1061941 0.5083555 1.370499 5932
## [5] 0.1754565 0.7378604 1.272048 9801
## [6] 0.1384891 0.7062905 1.217622 7736
What comes out of the previous analysis is that the infrastructure attacks doesn’t have deaths, as well as attacks with incendiary weapons. On the other side assasination (of course) and attacks to the police brings deaths. The results are very interesting, but we want to investigate what happens with the “explosion”. So what we are going to do is fix it as the “right-hand-side” and then as the “left-hand-side”. Let’s have a look of what are the rules that implies an “Explosion/Bombing”:
ExpBomItems = grep("atype=Bombing/Explosion", itemLabels(tData), value = TRUE)
rules <- apriori(data=tData, parameter=list(supp=0.1,conf = 0.5, minlen = 2), appearance=list(default="lhs",rhs=ExpBomItems), control = list (verbose=F))
rules_conf <- sort (rules, by="confidence", decreasing=TRUE) # 'high-confidence' rules.
inspect(head(rules_conf)) # show the support, lift and confidence for all rules
## lhs rhs support confidence lift count
## [1] {wtype=Explosives,
## ttype=Private Citizens & Property} => {atype=Bombing/Explosion} 0.1370981 0.9736300 1.770019 17021
## [2] {rgn=Middle East & North Africa,
## wtype=Explosives,
## trgt=1} => {atype=Bombing/Explosion} 0.1019637 0.9598150 1.744904 12659
## [3] {rgn=Middle East & North Africa,
## wtype=Explosives,
## gname=Unknown} => {atype=Bombing/Explosion} 0.1366551 0.9589645 1.743357 16966
## [4] {rgn=Middle East & North Africa,
## wtype=Explosives} => {atype=Bombing/Explosion} 0.1951318 0.9539673 1.734273 24226
## [5] {wtype=Explosives,
## trgt=0} => {atype=Bombing/Explosion} 0.3572153 0.9524106 1.731443 44349
## [6] {wtype=Explosives,
## trgt=1,
## gname=Unknown} => {atype=Bombing/Explosion} 0.1075698 0.9503985 1.727785 13355
rules_lift <- sort (rules, by="lift", decreasing=TRUE) # 'high-lift' rules.
inspect(head(rules_lift)) # show the support, lift and confidence for all rules
## lhs rhs support confidence lift count
## [1] {wtype=Explosives,
## ttype=Private Citizens & Property} => {atype=Bombing/Explosion} 0.1370981 0.9736300 1.770019 17021
## [2] {rgn=Middle East & North Africa,
## wtype=Explosives,
## trgt=1} => {atype=Bombing/Explosion} 0.1019637 0.9598150 1.744904 12659
## [3] {rgn=Middle East & North Africa,
## wtype=Explosives,
## gname=Unknown} => {atype=Bombing/Explosion} 0.1366551 0.9589645 1.743357 16966
## [4] {rgn=Middle East & North Africa,
## wtype=Explosives} => {atype=Bombing/Explosion} 0.1951318 0.9539673 1.734273 24226
## [5] {wtype=Explosives,
## trgt=0} => {atype=Bombing/Explosion} 0.3572153 0.9524106 1.731443 44349
## [6] {wtype=Explosives,
## trgt=1,
## gname=Unknown} => {atype=Bombing/Explosion} 0.1075698 0.9503985 1.727785 13355
It seems that explosives weapon in Middle East and North Africa implies with a near-1 level of confidence Bombing/Explosion events. Let’s see what happens dropping the redundant rules.
#dropping redundant rules
subsetRules <- which(colSums(is.subset(rules, rules)) > 1) # get subset rules in vector
length(subsetRules)
## [1] 14
rules <- rules[-subsetRules] # remove subset rules.
rules_conf <- sort(rules, by="confidence", decreasing=TRUE) # 'high-confidence' rules.
inspect(head(rules_conf)) # show the support, lift and confidence for all rules
## lhs rhs support confidence lift count
## [1] {wtype=Explosives} => {atype=Bombing/Explosion} 0.5426493 0.9439681 1.7160944 67371
## [2] {rgn=Middle East & North Africa} => {atype=Bombing/Explosion} 0.1962030 0.6761130 1.2291452 24359
## [3] {trgt=0} => {atype=Bombing/Explosion} 0.3622817 0.6572272 1.1948115 44978
## [4] {gname=Unknown} => {atype=Bombing/Explosion} 0.3009295 0.6432568 1.1694140 37361
## [5] {ttype=Private Citizens & Property} => {atype=Bombing/Explosion} 0.1382257 0.5371878 0.9765849 17161
## [6] {rgn=South Asia} => {atype=Bombing/Explosion} 0.1446292 0.5148526 0.9359805 17956
The results are a bit more clear since it’s evident that events in Middle East and North Africa are done by group whose name is unknown.
Let’s see now what implies a Bombing/Explosion attack:
rules <- apriori(data=tData, parameter=list(supp=0.1,conf = 0.5, minlen = 2), appearance=list(default="rhs",lhs=ExpBomItems), control = list (verbose=F))
rules_conf <- sort (rules, by="confidence", decreasing=TRUE) # 'high-confidence' rules.
inspect(head(rules_conf)) # show the support, lift and confidence for all rules
## lhs rhs support confidence
## [1] {atype=Bombing/Explosion} => {wtype=Explosives} 0.5426493 0.9865138
## [2] {atype=Bombing/Explosion} => {trgt=0} 0.3622817 0.6586130
## [3] {atype=Bombing/Explosion} => {gname=Unknown} 0.3009295 0.5470773
## lift count
## [1] 1.716094 67371
## [2] 1.194812 44978
## [3] 1.169414 37361
Strangely it happens that with a 65% of confidence it seems that Bombings doesn’t imply deaths.
rules_lift <- sort (rules, by="lift", decreasing=TRUE) # 'high-lift' rules.
inspect(head(rules_lift)) # show the support, lift and confidence for all rules
## lhs rhs support confidence
## [1] {atype=Bombing/Explosion} => {wtype=Explosives} 0.5426493 0.9865138
## [2] {atype=Bombing/Explosion} => {trgt=0} 0.3622817 0.6586130
## [3] {atype=Bombing/Explosion} => {gname=Unknown} 0.3009295 0.5470773
## lift count
## [1] 1.716094 67371
## [2] 1.194812 44978
## [3] 1.169414 37361
After the study, some assertions are made about the results obtained and future recommendations to address the analysis of data on terrorism.
Faced with the results of the different models, a good prediction of the dependent variables is highlighted, however, information that is so useful as to help the security and population organizations is not extracted. Therefore, it is recommended to link the existing data with other political information data of countries and news that allow to examine the tension associated with terrorism and the possibility of attack.
On the other hand, the clustering model yields interesting results on the types of attack and characteristics of the various groups. Against this, it is recommended a segmentation by different decades that allows observing the evolution of terrorism over time and some current characteristics about the attacks and terrorist groups that operate today.
Finally, the value obtained in the analysis of association rules is highlighted. With these techniques, a lot of information obtained is observed and it is not easy to see, due to the large amount of data present. However, this analysis was done at the end of the work and after the predictive models. It is recommended to do this stage during the exploratory analysis and before making a predictive model, since the association rules provide an adequate guide for the preprocessing of data and to focus the models of segmentation and prediction that can be done about terrorism.